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ABSTRACT 


DASH  is  an  experimental  distributed  system  intended  for  large  high-performance  net¬ 
works  connecting  heterogeneous  and  mumally  distrustful  hosts.  The  DASH  network 
communication  architecture  is  based  on  the  Parameterized  Message  Channel  abstraction. 
It  also  includes  a  subtransport  layer  that  incorporates  many  host-to-host  functions,  and  a 
remote  operation  facility  that  provides  request/reply  communication.  At  higher  levels,  a 
service  access  mechanism  and  a  global  naming  system  together  provide  uniform  global 
access  to  logical  resources. 

This  report  describes  and  discusses  the  DASH  network  communication  architecture,  and 
the  DASH  kernel  implementation  of  this  architecture. 
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1.  INTRODUCTION 

DASH  is  an  experimental  distributed  system  intended  for  large,  high-performance  net¬ 
works;  an  overview  is  given  in  [3].  DASH  includes  a)  an  architecture  for  network  com¬ 
munication,  and  b)  a  portable  operating  system  kernel  that  implements  this  architecture. 
This  report  describes  the  network  communication  architecture  and  its  implementation  in 
the  DASH  kernel.  [24, 25]  describe  other  aspects  of  the  DASH  kernel. 

The  design  of  DASH  is  guided  by  the  likely  hardware  advances  of  the  next  5  to  10  years: 

•  Wide-area  networks  with  low  delay  (30  milliseconds  coast-to-coast)  and  high 
bandwidth  (1  gigabit/second). 

•  Shared-memory  multiprocessor  workstations  with  10  to  100  MIPS  processing  capa¬ 
city. 

•  Large  memory  sizes,  ranging  from  gigabyte  workstation  main  memories  to  terabyte 
mass  storage  devices. 

These  advances  will  create  the  possibility  of  a  new  type  of  distributed  system  we  call  a 
very  large  distributed  system  (VLDS),  spanning  a  large  number  (thousands  or  millions) 
of  hosts,  owned  by  many  organizations  and  by  individuals.  A  VLDS  will  suppon  a  range 
of  new  applications,  such  as: 

•  Very  large  scale  distributed  parallel  computation:  the  set  of  processors  on  a  VLDS 
will  provide  a  “processor  bank”  numbering  perhaps  in  the  millions,  and  capable  of 
supporting  parallel  computation  at  many  levels  of  granularity. 

•  Very  large  scale  communication  applications:  a  VLDS  will  allow  a  variety  of  com¬ 
munication  applications  to  be  integrated  into  a  single  system:  commercial  applica¬ 
tions  (advertising,  sales,  and  remote  banking),  interpersonal  communication  (mail, 
telephone,  facsimile,  and  video  conferencing),  and  distribution  of  digital  audio  and 
video  entertainment  and  news. 

•  High-bandwidth  interfaces  to  distant  resources:  many  network  services  will  offer 
graphics/audio-based  interfaces.  Such  services  will  use  data  pipelining  and  caching 
to  achieve  the  needed  performance  in  the  presence  of  inherently  high  network 
delays. 

The  communication  architecture  of  DASH  is  designed  to  provide  the  basic  facilities 
needed  for  these  types  of  applications.  The  architecture  spans  multiple  levels,  from  the 
network  level  up  to  naming  and  service  access.  The  architecture  uses  the  abstraction  of  a 
Parameterized  Message  Channel  (or  simply  channel).  A  channel  is  a  simplex  message 
stream  with  several  performance,  reliability,  and  security  parameters.  The  interface  to 
the  network-dependent  communication  layer  is  based  on  channels,  and  the  channel 
abstraction  appears  at  higher  levels  of  the  DASH  system  as  well.  Channels  are  the  basis 
for  a  request/reply  communication  facility  called  the  Remote  Operation  Facility  (ROF). 

The  report  is  organized  as  follows:  Section  2  is  an  overview  of  the  DASH  communica¬ 
tion  architecture.  Section  3  presents  the  parameterized  message  channel  abstraction. 
Section  4  lists  the  local  kernel-level  object  types  underlying  the  network  communication 
abstractions,  and  Section  5  describes  the  mechanism  for  remote  references  to  these 
objects.  Sections  6  and  7  describe  the  network  and  subtransport  levels.  Sections  8 
through  10  describe  the  Remote  Operation  Facility,  the  Service  Access  Mechanism,  and 
the  global  naming  system.  Appendix  I  describes  the  DASH  Message  Language  (DML), 


used  for  specifying  networic  messages. 

2.  ARCHITECTURE  OVERVIEW 

We  begin  by  giving  an  overview  of  the  DASH  network  communication  architecture.  The 
structure  of  this  architecture  is  shown  in  Figure  2.1. 

2.1.  The  Network  Layer 

The  DASH  communication  architecture  can  be  implemented  on  multiple  networks.  Each 
network  to  which  a  DASH  host  is  connected  is  represented  in  its  kernel  by  a  software 
module  with  a  prescribed  channel-based  interface.  These  network  objects  provide  host- 
to-host  network  channels.  They  encapsulate  network-specific  protocols  for  channel  crea¬ 
tion,  deletion,  and  transmission,  and  other  tasks  such  as  routing  and  network  manage¬ 
ment.  The  DASH  network  layer  is  the  collection  of  these  networks  and  the  network 
objects. 


network  channel 

ST  channel 
function  calls 
or  local  IPC 


physical  media 


Figure  2.1:  The  DASH  Communication  Architecture. 
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2.2.  The  Subtransport  Layer 

The  subtransport  layer  (ST)  provides  the  basic  form  of  inter-host  process-level  commun¬ 
ication.  All  higher-level  network  communication  in  DASH  passes  through  ST,  and  ST  is 
the  only  direct  client  of  the  network  layer.  ST  supports  ST  channels,  which  are  “value- 
added”  versions  of  network  channels.  ST  provides  communication  security,  does  frag¬ 
mentation  and  reassembly,  multiplexes  ST  channels  onto  network  channels,  and  arranges 
for  “fast  acknowledgement”  of  messages  sent  on  ST  channels. 

In  most  existing  protocol  architectures,  the  functions  of  ST  are  done  at  higher  levels  (e.g., 
security  is  provided  at  the  transport  level).  The  DASH  architecture  has  the  advantage 
that  these  functions  are  consolidated  into  a  single  per  host  module.  For  example,  we 
argue  in  [2]  that  a  single  secure  channel  between  hosts  is  sufficient  for  authentication  and 
privacy,  rather  than  one  per  operation,  session,  or  network. 

2.3.  Transport  Protocols 

The  Remote  Operation  Facility  (ROF)  supports  request/reply  commuiucation.  It  uses  a 
set  of  ST  channels  called  the  ROF  connection.  Processes  can  also  directly  establish  ST 
channels  for  stream-mode  communication.  The  design  of  channel-based  stream  transport 
protocols  is  an  area  of  future  study. 

2.4.  Service  Access 

Certain  remotely-accessible  logical  resources  in  a  DASH  system  are  called  services. 
They  can  be  accessed  by  clients  through  the  service  access  mechanism  (SAM).  SAM 
provides  replication  transparency,  location  transparency,  and  failure  transparency.  It 
allows  clients  to  use  temporary  capabilities  (“service  tokens”)  to  reduce  name-resolution 
overhead. 

2.5.  Global  Naming 

A  goal  of  the  DASH  communication  architecture  is  that  resources  should  be  accessible 
from  any  host  in  the  system.  To  satisfy  this  goal,  a  global  naming  system  is  used  to  name 
long-lived  entities  such  as  hosts,  services,  and  owners.  The  system  uses  a  tree- structured 
name  space.  Names  are  location  independent  in  that  1)  they  do  not  imply  the  location  of 
the  named  entity,  and  2)  they  are  the  same  regardless  of  the  location  or  identity  of  the 
entity  using  the  name. 

The  DASH  global  naming  system  and  SAM  are  integrated.  The  global  naming  system  is 
implemented  as  a  set  of  services,  and  SAM  uses  the  naming  system  for  service  location 
and  authentication. 

3.  PARAMETERIZED  MESSAGE  CHANNELS 

The  DASH  distributed  system  is  intended  to  run  on  multiple  types  of  computer  architec¬ 
tures  and  communication  networks.  A  large  pan  of  the  DASH  network  communication 
system  is  network  independent,  and  is  based  on  a  network-dependent  pan  that  has  a 
network-independent  interface. 

In  most  existing  distributed  systems,  the  interface  to  the  network  level  typically  provides 
a  simple  abstraction  such  as  unreliable,  insecure  datagrams.  Upper  layers  then  use  this 
facility  to  provide  higher-level  abstractions  such  as  reliable  request/reply  message- 
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passing  [7],  reliable  and  secure  typed  message  streams  [20],  or  reliable  byte  streams  [17]. 
Because  the  bottom-level  abstraction  (e.g.,  datagrams)  is  simple,  it  is  easy  to  port  the  sys¬ 
tem  to  different  network  types.  However,  this  approach  suffers  from  several  basic  prob¬ 
lems,  stemming  from  the  simplicity  of  the  abstraction: 

•  Communication  clients  cannot  express  their  performance,  reliability  and  security 
needs  to  the  communication  provider.  This  results  in  wasted  work.  For  example, 
data  integrity  is  often  a  mandatory  part  of  communication  primitives,  and  is  pro¬ 
vided  by  sof^are  checksumming.  This  work  is  wasted  for  applications  that  do  not 
require  data  integrity.  Conversely,  network  interfaces  may  do  data  checksumming 
in  hardware,  but  if  this  is  concealed  from  upper  layers,  then  checksumming  may  be 
redundantly  done  in  software. 

•  Simple  abstractions  do  not  allow  the  communication  provider  to  impose  static  limits 
on  client  behavior,  such  as  the  amount  of  client  data  outstanding  within  the  network. 
The  problem  of  congestion  must  then  be  attacked  by  methods  (such  as  ICMP’s 
source-quench  messages  [18])  that  are  often  ineffective. 

•  No  provisions  are  made  for  real-time  performance  guarantees.  Such  guarantees  are 
needed  for  interactive  high-bandwidth  traffic  such  as  digitized  audio  [4]  and  video. 

In  an  attempt  to  solve  these  problems,  the  DASH  network  communication  system  is 
based  on  an  abstraction  called  a  parameterized  message  channel,  or  simply  a  channel. 
This  decision  is  motivated  by  the  projections  summarized  in  Section  1.  Current  networks 
and  protocol  architectures  do  not  directly  support  channels.  However,  our  approach  is 
capable  of  exploiting  future  advances  in  communication  technology. 

A  channel  links  a  sender  to  a  receiver.  It  carries  messages,  which  are  untyped  byte 
arrays.  The  sender  invokes  send  operations  on  the  channel.  The  receiver  is  typically  a 
passive  object  such  as  a  port;  a  message  is  considered  delivered  when  it  is  queued  on  the 
port  or  given  to  a  process  waiting  at  the  port.  The  sender  and  receiver  are  channel 
clients.  The  hardware  and  software  system  supporting  the  creation  and  use  of  channels  is 
the  channel  provider .  A  channel  client  of  one  level  may  be  an  channel  provider  to  a 
higher  level. 

A  channel  can  be  deleted  by  either  client  A  channel /mfr  when  one  of  the  clients’  hosts 
CTashes,  or  when  a  failure  or  resource  scarcity  in  the  network  makes  it  impossible  to  con¬ 
tinue  sending  messages  on  the  channel.  When  a  failure  occurs  no  further  messages  are 
delivered,  and  the  surviving  channel  clients  are  notified  of  the  failure.  A  channel  is  said 
to  be  closed  when  either  it  fails  or  it  is  deleted. 

3.1.  Channel  Parameters 

A  channel  has  the  following  security  and  performance  parameters.  Some  parameters 
represent  guarantees  by  the  channel  provider,  others  are  restrictions  on  the  channel  client. 
The  form  and  meaning  of  some  of  the  channel  parameters  depends  on  the  type  of  the 
channel  (see  Section  3.2). 

(1)  Authentication;  if  true^  then  impersonation  (delivery  of  a  message  not  sent  by  the 


'  The  authentication  and  privacy  parameters  are  Boolean.  They  could  instead  be  continuous,  perhaps 
representing  the  strength  of  the  underlying  encryption  system. 
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sender)  is  impossible.^  This  implies  that  the  bit  error  rate  (see  below)  is  unaffected 
by  the  presence  of  malicious  or  malfunctioning  hosts. 

(2)  Privacy:  if  true,  then  eavesdropping  (access  to  a  message  by  a  host  or  process  other  • 
than  the  receiver)  is  impossible. 

(3)  Sequenced:  if  true,  messages  are  never  delivered  out  of  order. 

(4)  Capacity:  an  upper  bound  on  the  number  of  bytes  of  data  outstanding  (i.e.,  sent  but 
not  yet  delivered)  within  the  channel  at  any  time.  The  clients  are  responsible  for 
enforcing  the  channel  capacity.  If  they  fail  to  do  so,  the  provider’s  guarantees  are 
voided;  messages  may  be  delivered  beyond  the  channel  delay  bounds  or  discarded. 

(5)  Maximum  message  size:  an  upper  bound  on  the  size  of  individual  messages.  This 
limit  cannot  be  greater  than  the  channel  capacity. 

(6)  Delay  parameters^:  message  delay  is  the  elapsed  real  time  between  the  start  of  the 
send  operation  and  the  moment  of  delivery.  The  components  of  this  delay  may 
include  network  transmission  delay,  queueing  and  processing  delays  at  the  sender 
and  at  intermediate  switches,  and  processing  at  the  receiver.  Depending  on  the 
channel  type,  several  parameters  might  be  used.  For  example,  a  deterministic 
guarantee  might  use  two  parameters  A  and  B,  representing  a  delay  bound  of 

A  +  B* (message  size). 

(7)  Workload  parameters:  some  channel  types  require  parameters  (such  as  average 
load  and  burstiness)  describing  the  workload. 

(8)  Message  loss  rate  parameters:  a  set  of  parameters  describing  the  probability  that  a 
given  message  is  successfully  delivered.  Messages  may  fail  to  be  delivered  because 
of  1)  buffer  overrun  in  the  receiver  host  or  in  network  switches,  and  2)  discarding 
due  to  checksumming.  The  form  and  meaning  of  the  parameters  depends  on  the 
channel  type.  The  loss  rate  may  be  constant  or  load-dependent,  and  it  may  or  may 
not  depend  on  the  message  size.  The  loss  rate  does  not  take  into  account  malicious 
or  malfunctioning  hosts.  The  channel  abstraction  does  not  specify  freedom  from 
denial  of  service. 

(9)  Average  bit  error  rate:  of  the  messages  that  are  delivered,  the  fraction  of  bits  that 
are  correct.  This  parameter  reflects  the  combination  of  1)  the  error  rate  of  the 
underlying  transmission  medium,  and  2)  the  effectiveness  of  the  checksumming 
algorithm.  It  is  guaranteed  by  the  channel  provider. 

(10)  Failure  reporting  delay  bound:  this  parameter  is  an  upper  bound  on  the  interval 
between  when  the  channel  fails  (see  below)  and  when  the  surviving  clients  are 

^  “Impossible”  assuming  an  unbreakable  encryption  scheme. 

^  Alternatively,  a  channel  could  have  a  “guaranteed  bandwidth”  parameter.  However,  this  parame¬ 
ter  is  less  convenient  from  the  point  of  view  of  implementation,  and  it  is  determined  by  the  current  parame¬ 
ters  as  follows.  If  A/  is  the  maximum  message  size,  £)  is  the  maximum  delay  of  a  message  of  size  M,  and  C 
is  the  channel  capacity,  then  a  client  can  send  a  message  of  size  M  every  DMIC  seconds  without  violating 
the  capacity  rule,  since  at  any  point  at  most  OM  messages  (of  total  size  Q  will  have  been  sent  within  the 
previous  D  seconds,  and  all  earlier  messages  are  guaranteed  to  have  been  delivered  already.  This  will  pro¬ 
vide  a  bandwidth  of  about  CID  bytes  per  second.  The  actual  maximum  bandwidth  may  either  be  lower  (be¬ 
cause  of  errors  and  protocol  overhead)  or  higher  (if  actual  delays  are  smaller  than  the  upper  bound). 
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notified  of  the  failure. 

A  set  of  channel  parameters  is  described  by  the  following  structure: 

enum  CHANNEL_TYPE  { 

DETERMINISTIC, 

STATISTICAL, 

BEST_EFFORT 

}; 

Struct  CHANNEL_PARAMS  { 

CHANNEL_TYPE 
BOOLEAN 
BOOLEAN 
BOOLEAN 
U32 
U32 

DELAY_PARAMS 
WORKLOAD_P ARAMS 
LOS  S_RATE_P  ARAMS 
U32 
U32 

}; 

The  meaning  and  form  of  delay_parains,  wor]cload_params,  and 
loss_rate_params  is  type  dependent  (see  blow).  The  channel  client  is  responsible 
for  obeying  the  channel  capacity  and  workload  parameters;  the  chaimel  provider  is  not 
responsible  for  detecting  violations.  In  the  event  of  a  violation,  the  channel  delay  and 
message-loss  parameters  are  voided,  but  the  other  parameters  remain  valid. 

3.2.  Channel  Types 

As  indicated  above,  there  are  different  channel  types.  A  channel  type  consists  of 

•  Parameterizations  of  the  delay  distribution,  the  workload,  and  the  message  loss  rate. 

•  A  partial  order  <  on  the  space  of  possible  parameter  values.  For  parameter  values  X 

and  Y,X  <  Y iff  Y  “dominates”  X,  i.e.,  Y is  acceptable  to  a  client  whenever is. 

The  following  are  examples  of  possible  chaimel  types: 

Deterministic:  the  delay  bounds  are  “hard”;  only  a  channel  failure  will  cause  them  to  be 
violated.  System  resources  (buffer  space,  media  bandwidth)  are  allocated  to  individual 
channels.  The  channel  provider  rejects  a  channel  request  if  its  worst-case  demands  can¬ 
not  be  met  with  free  resources. 

Statistical:  the  delay  bounds  are  statistical,  perhaps  involving  its  mean  and  variance'^. 
The  workload  parameters  include  average  load  and  bursdness.  A  channel  creation 
request  is  rejected  if  either  its  expected  message  delay  or  its  expected  bit  error  rate 
(which  is  affected  by  the  possibility  of  buffer  overruns)  is  unacceptably  high. 

Best-Effort:  channel  creation  requests  are  never  rejected,  and  there  are  no  workload 
parameters.  Delay  bound  parameters  are  used  only  to  schedule  resources  based  on  mes¬ 
sage  delivery  deadlines  (see  Section  3.5). 


Appropriate  parameterizations  of  the  delay  distribution  and  of  the  workload  are  currently  being  in¬ 
vestigated. 


type; 

authenticated; 
private; 
sequenced; 
capacity; 
inax_msg_s  i  ze ; 
delay_params ; 
workload__parains ; 
lo3s_rate_paranis ; 
bit_error_rate; 
f ailure_delay_bound; 
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3.3.  Channel  Creation  and  Ownership 

The  creator  of  a  channel  (which  may  be  either  the  sender  or  the  receiver)  owns  the  chan¬ 
nel.  It  is  “billed”  for  the  channel  if  such  a  notion  exists.  Either  side  may  delete  the 
channel. 

A  set  A  of  acndal  channel  parameters  is  said  to  be  compatible  with  a  set  R  of  requested 
parameters  if 

(1)  the  Boolean  security  parameters  of  A  include  those  of/?; 

(2)  the  capacity  and  maximum  message  size  parameters  of  A  no  less  than  those  of  R, 
and 

(3)  the  performance  parameters  of  A  dominate  (in  the  sense  defined  by  the  channel 
type)  the  parameters  of  R. 

A  channel  creation  request  includes  desired  and  acceptable  parameter  sets.  ^  A  channel 
creation  succeeds  only  if  the  acmal  parameters  of  the  resulting  channel  are  compatible 
with  the  request’s  acceptable  parameters.  Furthermore,  the  channel  provider  tries  to 
match  the  desired  parameters  as  closely  as  possible. 

3.4.  Parameterized  Message  Channel  Examples 

As  an  example  of  the  use  of  channel  parameters,  consider  the  case  of  a  DASH  client  (say 
a  transport  protocol  serving  a  user  program)  that  requires  data  privacy.  The  protocol 
requests  an  ST  channel  (see  Section  3.3).  The  desired  and  acceptable  parameter  sets  both 
have  the  private  flag  set  ST  creates  the  new  ST  channel,  routing  it  through  a  net¬ 
work  channel.  Depending  on  the  parameters  of  the  network  channel,  several  different 
situations  are  possible: 

(1)  Privacy  is  provided  by  ST-level  data  encryption. 

(2)  If  the  network  has  link-level  encryption  hardware,  ST  learns  this  from  the  network 
channel  parameters,  and  does  no  data  encryption. 

(3)  The  network  is  considered  secure,  so  no  data  encryption  need  be  done  at  any  level. 

In  any  case,  the  optimal  mechanism  is  used  for  privacy.  If  a  client  does  not  require 
privacy,  no  mechanism  is  used  (which  is  again  optimal).  Without  the  channel  security 
parameters,  this  optimization  would  not  be  possible.  A  similar  situation  exists  for  data 
integrity:  the  optimal  checksumming  mechanism  can  be  used  based  on  the  values  of  the 
relevant  channel  parameters. 

The  following  examples  show  the  uses  of  the  channel  capacity  and  performance  parame¬ 
ters: 

•  Initial  request  and  reply  messages  in  a  request/reply  protocol  [5]  should  be  sent  on 
channels  with  a  low  delay  bound.  The  precise  delay  bound  and  the  delay  bound 
type  are  determined  by  application  needs.  The  channel  capacity  required  may  be 
large,  unless  it  is  known  that  request  or  reply  messages  will  small  and  infrequent. 

•  Request/reply  retransmissions  can  be  sent  on  channels  with  higher  delay  bounds. 


^  The  current  design  allows  only  one  “acceptable”  parameter 
multiple  sets. 


set.  This  could  be  extended  to  allow 
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•  A  stream  protocol  for  bulk  data  transfer  should  use  a  high  capacity,  high  delay 
channel  for  data  [10]. 

•  Reliability  acknowledgements  (for  both  request/reply  and  stream  communication) 
should  use  low  capacity,  high  delay  channels. 

•  How  control  acknowledgements  should  use  a  low  delay,  low  capacity  channel. 

•  Digitized  voice  should  use  a  high  capacity,  low  delay  channel,  perhaps  with  a  sta¬ 
tistical  delay  bound  [22].  A  high  bit  error  rate  may  be  acceptable. 

•  Communication  involving  a  human  user  interface  traffic  (such  as  for  network  win¬ 
dow  systems  [11])  can  tolerate  a  moderate  amount  of  delay  because  of  human  per¬ 
ceptual  limitations.  The  channel  from  user  to  application  carries  mouse  and  key¬ 
board  events,  and  can  have  low  capacity.  The  channel  in  the  opposite  direction  car¬ 
ries  graphic  infonnation,  and  generally  requires  higher  capacity. 

In  all  these  cases,  the  provider’s  knowledge  of  client  needs  increases  the  likelihood  that 
they  can  be  accommodated.  For  example,  if  packet  queueing  in  an  internetwork  gateway 
is  done  using  channel-specified  deadlines,  then  a  low-delay  packet  can  be  sent  before 
high-delay  packets  that  would  otherwise  cause  it  to  be  delivered  late.  A  network  may  be 
capable  of  providing  low  delay  or  high  capacity,  but  not  both. 

3.5.  Process  and  Interface  Scheduling 

When  an  upper-level  channel  is  created,  its  total  delay  is  divided  among  its  various 
stages  (send  protocol  processing,  ST  channel  delay,  network  channel  delay,  and  receive 
protocol  processing).  When  a  message  is  sent  on  a  channel,  there  is  a  deadline  by  which 
it  must  be  handled  (i.e.,  processed  by  a  protocol,  sent  on  a  lower-level  channel,  or 
transmitted  on  a  network  medium).  This  deadline  is  the  current  real  time  plus  the  delay 
allocated  to  the  next  stage  of  the  channel. 

For  channels  whose  delay  includes  processing  time,  these  deadlines  are  used  by  the 
kernel’s  process  scheduler  to  determine  the  execution  order  of  protocol  or  user  processes. 
For  network  channels,  the  deadlines  are  used  to  determine  the  order  in  which  packets  are 
queued  for  transmission  on  a  network  interface. 

3.6.  Issues  in  the  Subtransport  Layer 

This  section  discusses  issues  that  arise  in  building  high-level  channels  on  top  of  low-level 
channels.  This  is  most  directly  relevant  to  the  DASH  subtransport  layer. 

3.6.1.  Channel  Caching  and  Multiplexing 

ST  caches  network  channels;  i.e.,  it  may  retain  a  network  channel  even  while  it  is  not 
being  used  by  any  ST  channel.  This  caching  is  motivated  by  two  assumptions:  1)  during 
a  given  time  period  a  host  will  tend  to  communicate  repeatedly  with  a  small  set  of  remote 
hosts;  2)  it  is  slow  and  costly  to  create  network  channels. 

ST  does  upwards  multiplexing  of  multiple  ST  channels  onto  a  single  network  channel 
(see  Figure  3.2)  This  multiplexing  can  reduce  overhead  and  delay  compared  to  a 


®  It  would  also  be  possible  to  downwards-multiplex  an  ST  channel  across  several  network  channels. 
If  there  were  multiple  networks  paths  between  the  hosts,  this  technique  could  be  used  to  increase  capacity 
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multiplexing 


ST  channel 


Figure  3.2:  Channel  Multiplexing 


policy  of  creating  a  network  channel  for  each  ST  channel. 

The  type  and  parameters  of  an  ST  channel  may  be  different  than  those  of  the  network 
channel  on  which  it  is  multiplexed.  In  some  cases  ST  can  offer  “better”  parameters  than 
those  of  the  network  channel.  However,  some  parameters  cannot  be  improved;  multi¬ 
plexing  is  subject  to  the  follow  restrictions: 

•  A  deterministic  ST  channel  can  only  be  multiplexed  onto  a  deterministic  network 
channel,  and  a  statistical  ST  channel  can  be  multiplexed  only  onto  a  deterministic  or 
statistical  network  channel. 

•  The  delay  bound  parameters  of  an  ST  channel  must  be  at  least  those  of  its  network 
channel. 

•  The  capacity  of  a  network  channel  must  be  at  least  the  sum  of  the  capacities  of  its 
ST  channels. 

Algorithms  for  multiplexing  decisions,  in  particular  those  involving  statistical  types  and 
mixed  types,  remain  to  be  studied. 

3.6.2.  Increased  Maximum  Message  Size 

At  the  network  level,  a  message  size  limit  is  imposed  by  hardware  restrictions,  channel 
capacity,  nonzero  bit  error  rate,  or  the  need  for  bounded  delay  and  fairness.  For  example. 


beyond  that  available  in  a  single  network  channel.  However,  this  has  not  been  included  in  the  DASH 
design  because  the  expected  gain  may  not  outweigh  the  additional  ST  protocol  complexity. 
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the  Ethernet  has  a  1.5KB  packet  size  limit  [14];  future  networks  may  have  a  limit  of 
64KB  or  so.  It  is  an  issue  whether  ST  should  offer  larger  maximiinn  message  sizes  to  its 
clients  than  those  provided  by  the  network  layer.  The  following  choices  exist: 

•  The  maximum  message  size  of  an  ST  channel  is  that  of  its  network  channel. 

•  The  ST  channel  has  a  much  larger  size  than  its  network  channel  (e.g.,  many  mega¬ 
bytes).  ST  must  incorporate  a  retransmission-based  reliability  mechanism;  other¬ 
wise  the  message  loss  rate  would  increase  exponentially  with  the  ST  message  size. 
Because  messages  can  exceed  the  capacity  of  the  network  channel,  a  flow-control 
mechanism  also  must  be  used.  This  gener^  approach  is  taken  in  VMTP  [8]. 

•  The  maximum  message  size  of  the  ST  channel  may  be  larger  than  that  of  its  net¬ 
work  channel,  but  no  greater  than  the  capacity  of  the  network  channel,  and  not  so 
large  that  a  reliability  mechanism  is  needed.  This  might  be  one  or  two  orders  of 
magnitude  more  than  the  network  channel  maximum  message  size  (e.g.,  64KB  on 
an  Ethernet). 

ST  uses  the  third  option.  This  choice  has  the  advantages  of  reducing  the  number  of 
high-level  messages  (and  hence  protocol  processing  and  scheduling  overhead)  without 
significantly  adding  to  the  complexity  of  ST,  or  requiring  that  it  assume  the  role  of  a  tran- 
spon  protocol. 

ST  chooses  a  maximum  message  size  with  the  goal  of  maximizing  potential  throughput 
based  on  the  combination  of  network  channel  error  rate  bound,  ST  channel  error  rate 
bound,  and  context  switch  time.  ST  does  fragmentation  and  reassembly  to  support  this 
larger  message  size.  It  does  not  retransmit  fragments;  if  a  message  is  incomplete  when  a 
fragment  of  the  next  message  arrives,  the  partial  message  is  discarded. 

3.6.3.  Failure  Detection  and  Handling 

If  an  ST  channel  has  a  smaller  failure  reporting  delay  bound  than  its  network  channel,  ST 
must  use  “pinging”  messages  on  the  network  channel  to  learn  of  host,  process  or  net¬ 
work  failures  within  the  specified  bound. 

The  handling  of  a  network  channel  failure  may  vary.  If  the  ST  channel  is  best-effort,  it 
may  be  possible  to  establish  a  new  network  channel  (or  switch  to  another  existing  net¬ 
work  channel)  and  continue  the  ST  channel  service  without  interruption.  For  other  ST 
channel  types,  the  delay  in  establishing  a  new  network  channel  might  make  this  impossi¬ 
ble. 

3.6.4.  Security  and  Reliability 

If  the  ST  channel  has  stricter  security  or  reliability  requirements  than  the  network  chan¬ 
nel,  techniques  such  as  encryption  and  checksumming  may  be  used  to  bridge  the  gap. 
These  are  discussed  in  Section  7.2.1. 

If  the  ST  channel  is  sequenced  and  the  networic  channel  is  not,  ST  can  include  sequence 
numbers  with  messages  and,  depending  on  the  other  channel  parameters,  either  reorder 
messages  before  delivery  or  simply  discard  messages  that  arrive  out  of  order. 
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3.7.  Flow  Control  and  Channel  Capacity  Enforcement 

At  points  where  there  can  be  speed  mismatches,  a  communication  system  may  do  tiger¬ 
ing  to  avoid  message  loss.  Flow  control  can  be  used  to  avoid  performance  loss  due  to 
buffer  overrun  and  dropped  packets.  Row  control  mechanisms  are  often  necessary  even 
for  minimal  performance  levels.  On  the  other  hand,  flow  control  mechanisms  may  not  be 
needed  in  certain  simations  (for  example,  if  the  speed  of  the  sender  is  sufficiently  low) 
and  in  that  case  may  impose  an  unnecessary  overhead. 

The  endpoints  of  communication  are  called  the  data  source  and  sink.  The  source  of  data 
may  be  an  I/O  device  (such  as  a  disk,  a  main  memory  cache,  or  a  real-time  audio/video 
digitizer)  or  a  computation  (e.g.,  a  process  generating  text  or  graphic  images).  It  is  useful 
to  factor  communication  system  buffers  into  three  groups  (see  Figure  3.3): 

(1)  Buffers  between  the  data  source  and  the  send  side  of  the  transport  protocol. 

(2)  Buffers  in  the  sender’s  kernel,  the  network  switches  and  gateways,  and  in  the 
receiver’s  network  interface  and  low-level  driver. 

(3)  Buffers  between  the  receive  side  of  the  transport  protocol  and  the  data  sink. 

In  a  system  based  on  parameterized  message  channels,  these  buffer  groups  can  be  treated 
separately.  Cases  in  which  no  flow  control  is  needed  for  one  or  more  of  ^e  buffer  groups 
can  be  identified,  and  a  better  transport  protocol  (simpler,  faster  or  fewer  messages)  may 
be  possible.  In  contrast,  existing  systems  generally  require  end-to-end  flow  control  [19]. 


data  source  data  sink 


RMS  capacity  enforcement 
^ _ 

sender  flow  control 


receiver  flow  control 

4- - ^ 

RMS  capacity  enforcement  +  receiver  flow  control 

vt- - > 


Figure  3.3:  Types  of  Row  Control. 
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3.7.1.  Channel  Capacity  Enforcement 

The  capacity  parameter  of  a  channel  prevents  overrunning  buffers  in  group  2.  In  con¬ 
trast,  the  flow  control  of  TCP  does  not  protect  gateway  buffers;  ICMP  source  quench 
messages  [15, 18]  provide  an  ad  hoc  and  often  ineffective  solution  to  this  flow  control 
problem. 

The  channel  approach  assumes  that  these  buffers  do  not  shrink  spontaneously.  If  they 
do,  it  may  be  necessary  for  the  channel  provider  to  delete  the  channel,  and  for  the  clients 
to  create  a  new  channel. 

As  described  in  Section  3.1,  capacity  enforcement  is  the  responsibility  of  the  channel 
client;  there  are  no  capacity-enforcement  mechanisms  in  the  DASH  ST  or  network 
layers.  Depending  on  the  channel  parameters  and  the  source  and  sink  speeds,  capability 
enforcement  may  not  be  needed.  If  it  is  needed,  the  following  approaches  are  possible: 

•  Rate-based',  using  timers,  the  sender  ensures  that  during  any  timft  period  of  duration 
A  +  CB,  the  number  of  bytes  sent  does  not  exceed  C.  This  approach  is  pessimistic 
in  the  sense  that  it  assumes  the  maximum  delay  for  all  messages. 

•  Acknowledgement-based:  the  sender  receives  flow  control  acknowledgements  for 
messages  received.  This  may  achieve  higher  maximum  throughput  at  the  cost  of 
the  reverse  message  traffic. 

These  mechanisms,  if  needed,  can  be  implemented  by  transport  protocols.  The  same 
mechanisms  may  possibly  be  used  to  provide  receiver  flow  control  (see  below). 

3.7.2.  Receiver  Flow  Control 

If  the  receive  transpon  protocol  and  data  sink  can  process  incoming  data  at  a  faster  aver¬ 
age  rate  than  its  arrival,  it  is  possible  that  no  flow  control  for  buffer  group  3  is  needed 
If  this  condition  does  not  hold,  the  protocol  must  stop  sending  data  when  the  limit  of  the 
receive  buffer  is  reached.  A  receiver  flow  control  mechanism  such  as  a  sliding  window 
protocol  [17]  can  be  used.  The  need  for  this  flow  control  mechanism  is  independent  of 
the  need  for  channel  capacity  enforcement;  if  both  are  needed  they  could  be  integrated 
into  a  single  protocol. 

3.7  J.  Sender  Flow  Control 

If  the  data  source  can  produce  data  faster  than  it  can  be  sent  on  the  ST  channel  (because 
of  capacity  enforcement,  receiver  flow  control,  or  both)  then  there  must  be  a  sender  flow 
control  mechanism  to  protect  buffer  group  1.  This  is  done  in  the  DASH  kernel  using  a 
flow  controlled  local  IPC  pon  for  message-passing  between  the  sender  and  the  send  pro¬ 
tocol  [24],  A  sender  blocks  when  a  port  queue  size  limit  is  reached.  The  sending  tran¬ 
sport  protocol  stops  reading  messages  from  the  port  while  it  is  prevented  from  sending 
because  of  channel  capacity  enforcement  or  receiver  flow  control. 

4.  KERNEL  STRUCTURE  FOR  NETWORK  COMMUNICATION 

Several  abstractions  (e.g.  messages  and  message-passing  objects)  are  used  throughout  the 
DASH  network  communication  architecture.  The  objects  which  represent  these 

’  A  large  receive  buffer  may  be  needed;  the  size  is  determined  by  several  factors,  including  the  varia¬ 
bility  of  the  receiver’s  speed. 
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abstractions  are  local  to  a  host,  and  might  be  implemented  in  different  ways  on  different 
hosts.  In  defining  the  network  architecture,  it  is  convenient  to  use  language-level 
specifications  in  which  these  abstractions  are  represented  by  specific  procedural  inter¬ 
faces. 

In  this  section,  we  briefly  describe  the  procedural  interfaces  that  are  used  in  the  prototype 
DASH  kernel  to  represent  these  abstractions.  The  DASH  kernel  is  being  written  in  C-m-, 
an  object  oriented  language  [21].  Therefore  the  interfaces  are  presented  as  class 
definitions.  However,  the  architecture  does  not  rely  on  an  object  oriented  implementa¬ 
tion,  nor  on  the  precise  details  of  the  interfaces. 

4.1.  Messages  and  Message-Passing  Objects 

Local  communication  is  assumed  to  follow  a  message-passing  paradigm  (the  DASH  local 
message-passing  system  is  described  in  [24]). 

•  A  MESSAGE  object  represents  data  sent  or  received,  either  locally  or  remotely. 
The  abstraction  is  an  array  of  bytes,  together  with  a  header  which  includes  an 
integer  type  field  and  a  flags  bitmask  (for  MP  options). 

•  A  STREAM_MPO  object  is  a  message-passing  object  (MPO)  with  uitidirectional 
asynchronous  semantics.  Its  basic  operations  are  send  ( )  and  receive  ( ) . 

•  A  REQ_REPLy_MPO  object  is  a  message-passing  object  for  synchronous 
(request/reply)  operations.  Its  operations  are  request_reply  ( )  for  the  client, 
and  get_request  ( )  and  send_reply{)  for  the  server. 

4.2.  NAMED_ENTITY  Objects 

Entities  with  global  names  (see  Section  10)  are  represented  by  objects  of  the  base  class 
NAMED_ENTITY.  The  following  classes  are  derived  from  NAMED_ENTITy: 

•  An  OWNER  object  represents  an  owner  in  the  DASH  global  name  space  (see  Sec¬ 
tion  10).  Its  data  include  the  owner’s  public  key  (see  Section  10)  and,  in  some 
cases,  the  owner’s  private  key. 

•  A  HOST  object  represents  a  host,  i.e.,  an  endpoint  of  physical  network  communica¬ 
tion.  Its  data  include  a  list  of  the  hosts  network  addresses,  and  the  host’s  public  key. 

Other  classes  derived  from  NAMED_ENTITY  will  be  introduced  as  needed. 

5.  REMOTE  REFERENCES 

The  remote  reference  mechanism  allows  DASH  hosts  to  refer  to  temporary  entities  (e.g., 
message-passing  objects)  in  other  hosts.  A  remote  reference  is  created  by  a  host  to  refer 
to  an  object  within  itself,  and  is  issued  to  a  holder,  which  may  be  either 

•  a  particular  host, 

•  all  hosts  on  a  particular  network  to  which  the  host  is  connected,  or 

•  all  other  hosts 

A  remote  reference  is  a  fixed-size  (64-bit)  datum;  the  issuing  host  determines  the  internal 
structure  of  the  datum.  Between  crashes,  a  host  must  never  issue  the  same  remote  refer¬ 
ence  twice  (i.e.,  referring  to  two  different  objects)  to  a  given  holder. 


14 


The  following  remote  reference  values  are  reserved: 

•  0  (NULL_REF )  is  the  equivalent  of  a  NULL  pointer,  i.e.  no  object  is  referenced. 

•  1  (ROF_NOTIFY_MPO)  refers  to  an  MPO  used  to  notify  the  ROF  module  of  the 
creation  or  failure  of  ST  channels. 

•  2  (ST_NOTIFY_MPO)  refers  to  an  MPO  used  to  notify  the  ST  layer  of  the  creation 
or  failure  of  network  channels.  In  the  DASH  kernel,  Ae  only  client  of  the  network 
layer  is  ST  and  ST_NOTIFY_MPO  is  not  needed.  However,  by  having  the 
notification  MPO  specified,  other  network  layer  implementations,  after  bootstraping 
the  connection  using  this  well-known  reference,  may  support  multiple  notification 
ports  and  clients. 

5.1.  Implementation  of  Remote  References 

A  host  must  maintain  the  set  of  remote  reference  it  has  issued,  the  holders  to  which  they 
were  issued,  and  the  type  of  the  object  to  which  the  remote  reference  refers.  The 
mechanism  to  authenticate  the  source  of  a  message  provided  by  ST  can  be  used  to  insure 
that  remote  references  are  used  only  by  those  who  have  permission  to  do  so.  Hosts  that 
allow  objects  to  be  deleted  must  also  have  a  mechanism  for  detecting  “dangling”  remote 
references.  This  can  be  done  using  unique  IDs’  (see  below). 

In  the  DASH  kernel,  a  remote  reference  is  represented  as  a  pair  <unique  object  identifier 
(UID),  session  identifier>  of  32-bit  numbers.  The  UID  is  a  sequence  number  assigned 
when  a  remotely-referenced  object  is  created,  and  stored  with  the  object  The  session 
identifier  is  used  to  detect  stale  remote  references. 

Each  entity  to  whom  remote  references  are  issued  manages  those  entries  using  a 
REMOTE_REF_MGR  object  The  REMOTE_REF_MGR  class  provides  the  following 
interface: 

struct  REMOTE_REF  { 

U32  object_uid; 

U32  session_id; 

}  ; 

enum  { 

S  TRE AM_MP  0_T YP  E , 

OUT_NET_CHANNEL_TyPE , 

IN_NET_CHANNEL_TYPE , 

OUT_ST_CHANNEL_TYPE , 

IN_ST_CHANNEL_TYPE , 

HOST_TYPE, 

OWNER_TYPE, 

NO_TYPE 
}  OBJECT  TYPE; 
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BOOLEAN 

REMOTE_REF_MGR : : add_ent  ry  ( 

SHARED_OBJECT*  object_  pointer, 
OBJECT_TYPE 
PTR_OR_INT 
PTR_OR_INT* 

REMOTE  REF* 


ob ject_type, 
data 

existing_data 
remote  ref 


)  ; 


//  TRUE  implies  a  new  entry 

//  pointer  to  object 
//  object  type 

//  data  associated  with  entry  if  new 
//  returned  if  not  a  new  entry 
/ /  returned 


SHARED_OBJECT* 

REMOTE_REF_MGR: :get_entry  ( 

REMOTE _ REF*  remote _ ref, 

OBJECT _ TYPE  object  type, 

PTR_OR_INT*  existing_data 

)  ; 


/ /  NULL  if  then  entry  does  not  exist 


//  returned 


Duplicate  remote  references  in  the  table  are  not  allowed;  if  an  entry  for  the  object  already 
exists,  it  is  returned.  There  are  also  functions  to  delete  entries  and  change  the  data  asso¬ 
ciated  with  an  entry. 


6.  THE  NETWORK  LAYER 

In  DASH  terminology,  a  network  is  an  abstract  entity  that  connects  a  set  of  hosts  and  pro¬ 
vides  parameterized  message  channel  service  between  pairs  of  these  hosts.  Different  net¬ 
works  need  not  be  physically  or  lo^cally  disjoint.  For  example,  the  DARPA  Internet 
and  a  local  Ethernet  (with  the  addition  of  support  for  channels)  could  be  two  separate 
DASH  networks,  although  they  might  share  host  interfaces  and  network  media. 

Each  DASH  host  is  connected  to  one  or  more  networks,  and  must  implement  the  channel 
protocols  of  those  networks.  A  host  has  one  or  more  network  addresses  in  each  network 
to  which  it  is  connected.  In  the  current  design,  there  is  no  mechanism  to  connect  chan¬ 
nels  of  different  networks.  A  pair  of  DASH  hosts  can  communicate  directly  only  if  they 
belong  to  a  common  network. 

In  the  DASH  kernel,  the  network  layer  consists  of  a  set  of  objects  derived  from  the  base 
class  NETWORK.  Each  of  these  objects  corresponds  to  a  network  to  which  the  host  is 
connected.  A  NETWORK  object  has  the  following  Boolean  attributes: 

All  hosts  trusted:  true  if  this  kernel  trusts  the  kernels  of  all  the  hosts  on  the  net¬ 
work.  This  does  not  imply  that  the  network  is  physically  secure. 

Physical  broadcast  network:  true  if  the  network  has  the  property  that  any  message 
completely  received  by  any  node  is  received  by  all  nodes.  LAN’s  such  as  Ethernet 
[14]  have  this  propeny. 

Network  objects  should  suppon  only  those  channel  security  and  reliability  properties  for 
which  it  has  hardware  support  (such  as  by  hardware  checksumming  or  encryption  in  the 
network  interface)  or  that  hold  by  virtue  of  trust  properties  (e.g.,  that  the  network  is  phy¬ 
sically  secure).  The  gap  between  these  properties  and  those  required  by  clients  is  bridged 
by  ST. 


6.1.  Network  Channel  Endpoint  Objects 

A  network  channel  is  represented,  in  each  of  the  kernels  it  connects,  by  an  endpoint 
object.  An  endpoint  is  active  on  the  host  that  created  the  channel,  and  passive  on  the 
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Other  end.  It  is  outgoing  or  incoming  depending  on  the  data  direction  relative  to  an  end¬ 
point.  There  are  two  network  endpoint  base  classes,  OUT_NET_CHANNEL  and 
IN_NET_CHANNEL.  For  each  type  of  network,  there  are  network  channel  endpoint 
classes  derived  from  these  base  classes.  The  classes  are  derived  from  a  base  class  con¬ 
taining  a  type  field  that  is  used  to  distinguish  between  the  endpoint  types  in  operations 
such  as  NETWORK :  :  delete  ( )  that  take  generic  pointers. 

6.2.  Network  Channel  Creation:  Active  End 

The  NETWORK  class  provides  the  following  virtual  function  to  create  a  set  of  network 
channels; 

NET_CHANNEL_CREATE_STATUS 
NETWORK: : create  channel  ( 


HOST* 

destination. 

// 

destination  host 

NETWORK_ADDRE  S  S 

network_address. 

// 

network  address  to  use 

U32 

to_peer. 

// 

opaque  data  delivered  to  peer 

U32* 

f rom_peer. 

// 

opaque  data  returned  from  peer 

int 

desc_count. 

// 

number  of  channels  requested 

NET  CHANNEL  REQUEST []  descs 

)  ; 

// 

list  of  requests  descriptors 

struct  NET  CHANNEL  REQUEST  { 

CHANNEL_D IRECT ION 

direction; 

// 

CHANNEL_IN  or  CHANNEL_0UT 

CHANNEL_P ARAMS  * 

desired; 

CHANNEL_P ARAMS  * 

acceptable; 

STREAM_MPO* 

data_npo; 

// 

for  incoming  channels  only 

STREAM  MPO* 

closure_inpo; 

// 

closure  notification  MPO 

U32 

closure_data; 

// 

delivered  in  closure  notification 

CHANNEL_ENDPO INT * 

endpoint; 

// 

endpoint  object  (returned) 

}  NET_CHANNEL_REQUEST; 

enum  NET_CHANNEL_CREATE_ 

_STATUS  { 

CREATE_ACCEPTED , 

// 

request  succeeded 

CREATE_FAILED, 

// 

request  failed 

CREATE_REJECTED 

// 

request  rejected  by  peer  ST 

}; 


NETWORK:  : create_chann-el  ( )  attempts  to  establish  a  set  of  network  channels. 
Either  the  entire  set  of  channels  is  created  or  none  are.  Each  channel  in  the  request  is 
described  by  a  NET_CHANNEL_RE QUEST  structure.  The  actual  parameters  of  the 
resulting  channels  are  not  returned,  but  are  public  data  of  the  endpoint  object. 

6.3.  Network  Channel  Creation:  Passive  End 

ST  is  notified  of  network  channel  creation  by  request/reply  notification  operations 
directed  from  the  network  layer  to  the  ST_NOTIFY_MPO.  These  operations  inform  the 
passive  ST  (the  host  receiving  the  request)  of  a  network  channel  creation  request.  It  may 
either  accept  or  reject  the  request.  The  request  message  format  is: 
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struct  NET_NOTIFY_CREATE_REQUEST  { 


HOST* 

source; 

// 

U32 

opaque_data; 

// 

U32 

desc  count; 

// 

NET_CREATE_ 

_REQUEST_DESC[]  descs; 

// 

}; 

Struct  NET_CREATE_REQUEST_DESC  { 

CHANNEL_DIRECTION  direction;  // 

CHANNEL_P ARAMS  actual_params ;  // 

CHANNEL_ENDPOINT*  endpoint;  // 

}; 

The  reply  message  format  is: 

struct  NET_NOTIFY_CREATE  REPLY  { 


BOOLEAN 

accepted; 

// 

U32 

opaque  data; 

// 

NET_CREATE_ 

REPLY_DESC[]  descs; 

// 

,  NET_CREATE 

_REPLY_DESC  { 

STREAM  MPO* 

data_mpo; 

// 

U32 

cl03ure_data; 

// 

STREAM_MPO* 

closure_mpo; 

// 

In¬ 


active  host 

from  NETWORK: : create_channel 0 
number  of  channels  in  request 
array  of  descs 


CHANNEL_IN  or  CHANNEL_OUT 
parameters  of  this  channel 
local  channel  endpoint 


FALSE  — >  rejected 
returned  to  peer  ST 
present  only  if  accepted 


for  incoming  channels  only 
included  in  closure  notification 
closure  notification  MPO 


6.4.  Network  Channel  Deletion 

The  NETWORK  class  provides  the  following  virtual  function  to  delete  network  channels: 
void 

NETWORK: : delete  ( 

U32  to_peer,  / /  delivered  to  peer 

int  endpoint_count,  //  number  of  channels  to  be  deleted 

CHAtJNEL_ENDPOINT  [  ]  endpoints  //  list  of  local  net  channel  endpoint 


This  deletes  the  specified  network  channels  and  their  associated  endpoints  on  the  local 
host.  Network  channels  may  be  deleted  by  either  client. 


6.4.1.  Closure  Notification 


ST  is  informed  of  channel  closure  (both  failures  and  deletions  by  the  peer)  by  messages 
delivered  to  the  channel’s  closure  MPO.  Failure  notifications  are  guaranteed  to  be 
delivered  within  the  failure_delay_bouncl  channel  parameter.  Network  channel 
closure  notification  messages  have  their  type  field  set  to  NET_CLOSE_TYPE,  and 
have  the  following  format: 


struct  NET_NOTIFY_CLOSE  { 

NET_CLOSE_REASON  close_reason; 
CHANNEL_ENDPOINT*  local_endpoint ; 
U3  2  my_opa  que ; 

U32  peer_opaque; 


//  reason  channel  was  closed 
/ /  local  network  channel  endpoint 
//  data  supplied  upon  channel  creatic 
//  data  supplied  by  peer 
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enum  NET_CLOSURE_REASON  { 

DELETED_BY_PEER, 

PEER_CRASH, 

NETWORK_FAILURE , 

}; 

My_opaque  is  the  closure_data  supplied  locally  when  the  channel  was  created.  If 
the  channel  was  deleted  by  the  peer,  peer_opaque  is  the  data  supplied  by  the 
NETWORK :  :  delete  ( )  operation  of  the  peer.  ST  is  responsible  for  eventually  deleting 
the  network  channel  endpoint  objects  after  receiving  the  notification  message;  this 
prevents  any  “dangling  pointer”  problem. 

6.5.  Network  Channel  Data  Messages 

The  OUT_NET_CHANNEL  base  class  provides  the  following  virtual  function  to  send  a 
message: 

OUT_NET_CHANNEL : : send  ( 

MESSAGE*  message 

)  ; 

The  header  of  message  contains  a  DEADLINE  flag.  If  set,  the  header  contains  a 
deadline  for  message  transmission;  this  should  not  precede  the  current  real  time  plus  the 
guaranteed  delay  of  the  channel.  This  allows  the  client  to  specify  that  a  message  has  less 
stringent  delay  parameters  than  that  normally  associated  wi±  the  channel.  The  deadline 
is  used  for  queueing  when  the  network  interface  has  a  nonempty  transmission  queue. 

Each  network  channel  has  an  associated  data  MPO  to  which  data  messages  are  delivered. 
Data  messages  have  their  type  field  set  to  NET_DATA_TYPE  to  distinguish  them 
from  closure  notification  messages. 

6.6.  Security  Considerations 

A  single  host  may  be  connected  to  both  secure  and  insecure  networks.  To  insure  that 
messages  from  an  insecure  network  do  not  interfere  with  authenticated  channels,  each 
network  maintains  a  table  of  valid  MPO’s  and  will  deliver  messages  only  to  these.® 

If  a  network  is  not  physically  secure,  its  channel  establishment  protocol  will  be  insecure, 
and  channel  creation  notification  messages  may  have  false  remote  host  names.  Such  a 
“spurious”  channel  does  not  represent  a  security  violation;  it  will  be  detected  and 
rejected  by  ST. 

7.  THE  SUBTRANSPORT  LAYER 

All  network  communication  in  DASH  (except  for  network-internal  tasks  such  as  routing 
and  network  management)  passes  through  the  subtransport  (ST)  layer.  ST  enhances  the 
channel  service  provided  by  the  network  layer  in  the  following  ways: 

•  ST  multiplexes  ST  channels  onto  network  channels  and  caches  idle  network  chan¬ 
nels. 


*  We  currently  assume  that  each  networic  uses  a  sperate  network  interface.  Having  networks  share  in¬ 
terfaces  is  currently  under  investigation. 
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•  ST  uses  encryption  and/or  checksumming  mechanisms  to  improve  security  and 
error  rate  parameters,  as  needed. 

•  ST  does  fragmentation  and  reassembly  so  that  the  maximum  message  sizes  avail¬ 
able  to  its  clients  can  exceed  that  offered  by  the  network  layer. 

•  ST  provides  “fast  acknowledgements”  that  clients  can  use  to  provide  flow  control 
and  capacity  enforcement. 

The  clients  of  ST  include  stream  transport  protocols  and  the  remote  operation  facility 
(ROF). 

7.1.  ST  Client  Interface 

Clients  interact  with  ST  through  a  combination  of  procedure  calls  and  message-passing 
operations.  The  procedural  interface  allows  clients  to  initiate  operations  such  as  channel 
creation  (for  the  remainder  of  this  section  “channel”  refers  to  ST  channels  only,  not  net¬ 
work  channels).  The  message-passing  interface  is  used  to  inform  clients  of  events  such 
as  channel  creation  and  closure,  and  to  allow  them  to  send  and  receive  messages  on  exist¬ 
ing  channels. 

The  interface  for  creating  and  closing  channels  involves  the  following  objects  (see  Figure 
7.1): 

•  The  ST  layer  is  represented  by  the  SUBTRANSPORT  module.  Its  operations 
include  channel  creation  and  deletion. 


active  client  passive  client 


1  Great 

client  layer  delet 

e_channel ( )  9® 

e_channel ( ) 

t_requ 

send_ 

est  ( ) 
reply  0 

i _ J 

notification 

MPO 

r  - 

subtransport  (ST)  layer 

I  reque 

St  renly 0 

Figure  7.1:  Client  Interface  for  Creating  and  Deleting  Channels. 
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•  Notification  of  channel  creation  is  done  by  performing  request/reply  operations  on 
client-specified  notification  MPO’s. 

The  interface  for  interactions  with  existing  channels  involves  the  following  objects  (see 

Figure  7.2): 

•  A  channel  endpoint  is  represented  on  its  local  host  by  an  endpoint  object.  An  ST 
endpoint  is  active  on  the  host  that  created  the  channel,  and  passive  on  the  other  end. 
It  is  outgoing  or  incoming  depending  on  the  data  direction  relative  to  an  endpoint. 
There  are  two  ST  endpoint  base  classes,  OUT_ST_CHANNEL  and 
IN_ST_CHANNEL.  OUT_ST_CHANNEL  is  derived  from  the  STREAM_MPO 
base  class.  Both  endpoint  classes  are  derived  from  a  base  class  that  includes  a 
type  field,  so  their  types  can  be  determined  from  generic  pointers. 

•  Each  channel  has  an  associated  stream-mode  data  MPO  at  its  receiving  end.  Client 
messages  are  sent  to  this  MPO  by  default 

•  Each  channel  has  an  associated  stream-mode  closure  MPO  at  each  end.  In  the  event 
of  channel  closure,  a  message  is  sent  to  this  MPO. 

•  Each  channel  has  an  associated  stream-mode  acknowledgement  MPO  at  its  outgoing 
end.  Fast  acknowledgement  messages  are  sent  to  this  MPO. 

The  stream-mode  messages  sent  by  ST  include  a  value  in  the  type  field  of  the  message 

header.  These  values  include 


-  RMS 

endpoint 

(incoming) 


RMS  - V_ 

endpoint  , 

— Coutgoing^  p 


send ( ) 


sendO  sendO 


send ( ) 


subtransport  (ST)  layer 


Figure  7.2;  The  Client  Interface  for  Channel  Usage. 
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ST_DATA_TYPE:  the  message  is  channel  client  data. 

ST_REDIRECTED_DATA_TYPE:  if  a  channel  client  data  message  is  directed  to  a 
MPO  other  than  the  default  data  MPO,  but  the  MPO  remote  reference  is  invalid,  it 
is  assigned  this  type  and  delivered  to  the  default  data  MPO. 

ST_CLOSURE_TYPE:  the  message  is  a  notification  of  channel  closure. 
ST_ACK_TYPE:  the  message  is  a  fast  acknowledgement. 

The  presence  of  this  type  field  allows  clients  to  use  a  single  MPO  to  handle  multiple  mes¬ 
sage  types.  For  example,  a  single  MPO  may  serve  as  both  the  data  MPO  and  closure 
MPO  for  an  incoming  channel. 

7.1.1.  Channel  Creation:  Active  End 

The  SUBTRANSPORT  module  provides  the  following  operation  for  creating  ST  chan¬ 
nels: 

S T_CHANNEL_CRE ATE_S T ATUS 
SUBTRANSPORT :: create  channel ( 


HOST* 

destination. 

REMOTE_REF* 

notify_mpo. 

// 

who  to  notify  at  peer 

U32 

to_peer. 

// 

passed  to  peer  in  notification 

U32* 

f rom_peer. 

// 

returned  from  peer 

int 

desc_count. 

// 

number  of  channels  to  create 

)  ; 

ST_CREATE_DESC [ ] 

descs 

// 

array  of  per-channel  descriptors 

struct  ST  CREATE  DESC  ( 

CHANNEL_D IRECTION 

direction; 

// 

CHANNEL_IN  or  CHANNEL_OUT 

CHANNEL_P  ARAMS  * 

desired; 

CHANNEL_P ARAMS  * 

acceptable; 

CHANNEL_ENDPOINT * 

endpoint ; 

// 

channel  endpoint  object  (returned 

U32 

to_peer; 

// 

passed  to  passive  client 

U32 

f rom_peer; 

// 

returned  from  peer 

STREAM_MPO* 

data_mpo; 

// 

incoming  only 

STREAM_MPO* 

closure  mpo; 

// 

MPO  for  closure  notification 

U32 

closure  data; 

// 

opaque  data  in  closure  message 

)  ; 

STREAM_MPO* 

ack  mpo; 

// 

outgoing  only 

enum 

ST_CHANNEL  CREATE  STATUS  { 

CREATE_ACCEPTED , 

// 

request  succeeded 

CREATE_F AILED, 

// 

request  failed 

CREATE_REJECTED 

// 

request  rejected  by  peer 

} ; 

INVALID_NOTIFY_MPO 

// 

invalid  notify  mpo  was  specified 

SUBTRANSPORT:  :  create  channel  ()  is  used  to  create  a  set  of  channels  to  the 

specified  host.  Either  the  entire 

set  is  created 

or  none 

are.  The  peer  is  notified  at 

not  ify_ 

_mpo.  To_peer  is  passed  to  the  passive  client  in  the  notification  message;  it 

can  be  used  to  identify  the  request.  If  the  peer  rejects  the  request,  f  rom_peer  can  be 
used  to  provide  an  error  code.  Each  channel  is  described  by  an  ST_CREATE_DESC 
structure.  If  the  operation  succeeds,  a  pointer  to  the  channel  endpoint  object  is  returned 
in  endpoint.  The  to_peer  and  from_peer  fields  in  the  ST_CREATE_DESC 
allow  clients  to  exchange  a  word  of  data  associated  with  each  channel.  The  actual 


22 


parameters  of  the  channel  are  public  data  of  the  object.  Data_mpo,  for  incoming 
channels,  specifies  the  data  MPO  (for  outgoing  channels,  the  data  MPO  is  specified  by 
the  passive  ST  client).  Oosure  notification  messages  will  be  written  to 
closure_mpo.  For  outgoing  channels,  ack_mpo  specifies  a  stream  MPO  to  which 
fast  acknowledgements  (see  Section  7.1.6)  are  to  be  sent 


7.1.2.  ST  Creation:  Passive  End 


ST  notifies  the  passive  client  of  attempts  to  create  channels  by  doing  a  request/reply 
operation  on  a  notification  MPO.  The  passive  client  may  either  accept  or  reject  the  crea¬ 
tion  request.  If  it  accepts,  the  reply  message  specifies  data,  acknowledgement,  and  clo¬ 
sure  MPO’s  for  each  channel.  The  passive  client  is  also  notified  when  an  attempted  ST 
channel  creation  request  fails,  e.g.,  because  a  network  channel  of  sufficient  capacity 
could  not  be  created.  Notification  request  messages  have  the  following  structure: 


struct  ST_NOTIFY_REQUEST  { 

HOST*  source; 

BOOLEAN  success; 

int  desc_count; 

ST_NOTIFY_REQ_DESC [ ]  descs; 


In¬ 

struct  ST_NOTIFY_ 

_REQ_DESC  { 

BOOLEAN 

direction; 

void* 

endpoint; 

U32 

opaque_data; 

); 

//  active  host 

//  was  the  creation  successful? 
/  /  number  of  channels 
//  defined  only  if  successful 


//  true  iff  outgoing 
/ /  local  channel  endpoint  object 
//  from  create  channel ()  call 


Notification  reply  messages  have  the  following  form: 


struct  ST_NOTIFY 

_REPLY  { 

BOOLEAN 

accepted; 

ST  NOTIFY  REPL  : 

)  ; 

DESC[]  descs; 

// 

Struct  ST  NOTIFY 

_REPL_DESC  { 

U32 

opaque ; 

// 

STREAM_MPO* 

data  mpo; 

// 

STREAM_MPO* 

ack  mpo; 

// 

STREAM_MPO* 

closure  mpo; 

// 

}; 


present  only  if  accepted 


for  closure  message 
incoming  only 
outgoing  only 

where  to  deliver  closure  message 


The  data  items  returned  by  the  passive  client  in  its  ST_NOTIFY_REPLY  message  are 
analogous  to  those  supplied  by  the  active  client  in  the 
SUBTRANSPORT ::  create  channel  ()  call  above. 


7.1.3.  ST  Channel  Creation  Scenarios 

The  first  ST  channels  established  to  a  given  host  are  those  created  by  the  ROF  module 
when  it  first  invokes  a  remote  operation  on  that  host.  The  establishment  of  subsequent 
channels  follows  this  scenario: 

(1)  The  active  client  contacts  the  passive  client  via  ROF.  The  passive  client  creates  a 
notification  MPO  and  obtains  a  remote  reference  for  it,  which  is  included  in  the 
ROF  reply  message. 
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(2)  The  active  client  receives  the  ROF  reply  message. 

(3)  The  active  client  calls  SUBTRANSPORT:  : create_channel  (),  and  the  two 
ST  modules  set  up  the  requested  channels. 

(4)  When  the  charmels  have  been  created,  the  passive  ST  does  a  request/reply  operation 
on  the  notification  MPO.  If  the  channel  group  is  rejected,  the  channels  are  deleted. 

(5)  The  call  to  SUBTRANSPORT: : create_channel  ()  returns  at  the  active  end 
with  the  appropriate  status  information. 

7.1.4.  ST  Channel  Deletion 

The  SUBTRANSPORT  module  provides  the  following  operation  for  deleting  ST  chan¬ 
nels: 

void 

SUBTRANSPORT : : delete  channel ( 


U32* 

opaque. 

// 

array  of  opaque  data 

int 

endpoint_count , 

// 

number  of  channels  to  delete 

CHANNEL_ 

_ENDPOINT [ ]  endpoints 

// 

endpoint  objects 

) 

This  deletes  the  specified  channels  and  the  corresponding  local  endpoint  objects.  The 
peers  may  be  on  different  remote  hosts.  For  each  channel,  a  closure  message  (of  type 
ST_CLOSURE_TYPE)  containing  the  corresponding  element  of  opaque  is  delivered 
to  the  peer’s  closure  MPO  if  possible.  The  client  is  responsible  for  the  deleting  the  end¬ 
point  after  receiving  the  closure  notification  message.  Having  the  client  delete  the  end¬ 
point  solves  any  “dangling  pointer”  problem.  A  closure  message  has  the  following 
structure: 

struct  ST  CLOSURE  { 


CHANNEL_ENDP 0 INT * 

endpoint ; 

// 

local  endpoint  object 

ST_CLOSURE  REASON 

reason; 

U32 

my_opaque; 

// 

supplied  at  channel  creation 

U32 

}; 

pee  r_opaque ; 

// 

defined  only  for  PEER_CLOSE 

enum  ST_CLOSURE_REASON 

{ 

DELETED_BY_PEER, 

PEER_CRASH, 

NETWORK_FAILURE 

}  ; 


7.1,5.  ST  Channel  Data  Messages 

A  data  message  can  be  sent  on  an  outgoing  channel  using  the 
STREAM_MPO:  :  send  ()  operation  on  its  endpoint  object.  The  message  header  con¬ 
tains  a  bitmask  with  the  following  flags: 

AUTHENTICATE_SENDER 

This  is  meaningful  only  on  a  host-authenticated  ST  channel.  The  message  header 
contains  a  pointer  to  an  OWNER  object.  On  delivery,  the  incoming  message  will 
include  a  pointer  to  a  corresponding  OWNER  object  in  the  peer  kernel.  If  the  source 
kernel  is  security  correct  [1]  (i.e.,  if  it  provides  privacy  and  authentication  locally) 
the  receiver  knows  that  the  message  originated  from  OWNER.  If  the  source  kernel 
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is  not  security  correct,  it  knows  only  that  the  source  kernel  possesses  the  private  key 
of  OWNER. 

AUTHENT I CATE_RECE IVER 

This  is  meaningful  only  on  a  host-authenticated  ST  channel.  The  outgoing  message 
header  contains  a  pointer  to  an  OWNER  object.  ST  will  verify  that  the  owner  is 
present  on  the  destination  host  before  sending  the  message.^  However,  there  is  no 
guarantee  that  the  message  is  actually  delivered  to  that  owner. 

DESTINATION_MPO 

The  message  header  contains  a  remote  reference  to  an  MPO  on  the  remote  host;  the 
message  is  to  be  delivered  to  this  MPO  (instead  of  the  default  data  MPO). 

ACK_REQUESTED 

The  message  header  contains  a  32-bit  message  ID;  ST  sends  an  acknowledgement 
message  containing  this  ID  to  the  acknowledgement  MPO  when  this  message  is 
delivered. 

DEADLINE 

The  message  header  contains  a  deadline  for  transmission.  This  deadline  should  not 
precede  the  current  real  time  plus  the  guaranteed  delay.  This  allows  the  client  to 
specify  that  a  message  has  less  stringent  delay  parameters  than  those  normally  asso¬ 
ciated  with  the  channel.  This  deadline  determines  the  message’s  queueing  priority 
at  the  network  level. 

If  the  DESTINATION_MPO  flagisnotsetina  STREAM_MPO :  :  send  ( )  operation,  or 
if  an  invalid  remote  reference  if  given,  the  message  will  be  delivered  to  the  channel’s 
default  data  MPO  (in  the  latter  case  the  message  is  assigned  type 
ST_REDIRECTED_DATA_TYPE).  This  facility  is  used  by  ROF  to  handle  retransmitted 
replies  sent  to  MPO’s  that  have  been  deleted. 

The  header  of  a  message  delivered  by  ST  includes  a  copy  of  the  flags  bitmask  sup¬ 
plied  by  the  sender.  If  the  AUTHENT  I  CATE_SENDER  flag  was  set,  the  header  will 
include  a  pointer  to  an  OWNER  object  representing  the  sender. 

7.1.6.  Fast  Acknowledgements 

If  a  message  is  sent  with  the  ACK_REQUESTED  flag  set,  ST  will  (unreliably)  send  an 
acknowledgement  message  to  the  channel’s  acknowledgement  MPO  after  the  original 
message  has  been  delivered.  The  acknowledgement  message  has  its  type  set  to 
ST_ACK_TYPE,  and  has  the  following  format: 

struct  ST_NOTIFY_ACK_REPLY  ( 

U32  message_id.; 

}  ; 

This  facility  can  be  used  for  acknowledgement-based  flow  control.  ST-level  ack¬ 
nowledgements  may  be  preferable  to  higher-level  acknowledgements,  since  the  ack¬ 
nowledgement  can  be  sent  earlier.  The  facility  does  not  replace  higher-level  reliability 
acknowledgements. 


®  We  have  not  yet  defined  precisely  what  it  means  for  an  owner  to  be  present  on  a  host 
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7.2.  The  Subtransport  Protocol 

This  section  describes  the  DASH  Subtransport  Protocol.  This  protocol  is  used  for  com¬ 
munication  between  ST  layers,  and  must  be  implemented  on  all  DASH  hosts.  The  proto¬ 
col  consists  of  two  related  subprotocols:  The  ST  Control  Protocol  is  used  for  secure 
channel  establishment,  owner  certification,  ST  channel  creation  and  deletion,  fast  ack¬ 
nowledgements,  and  pinging.  The  ST  Data  Protocol  is  used  for  conveying  client  mes¬ 
sages. 

7.2.1.  Security  Encapsulations 

Recall  from  Section  3. 1  that  the  parameters  of  a  channel  include  the  following: 

•  Private:  true  if  eavesdropping  is  impossible. 

•  Host-authenticated:  true  if  impersonation  by  another  host  is  impossible. 

•  Bit  error  rate:  the  long-term  average  bit  error  rate. 

In  addition,  a  given  network  has  the  parameters 

•  All  Hosts  Trusted:  this  flag  is  true  if  this  host  trusts  the  kernels  of  all  hosts  on  the 
network.  This  trust  simplifies  owner  certification,  but  is  orthogonal  to  network 
security;  it  does  not  imply  that  the  network  is  private  or  host-authenticated. 

•  Physical  Broadcast  Network (PBN):  this  flag  is  ffue  if  any  message  received  com¬ 
pletely  by  any  host  on  the  network  is  received  by  its  addressee. 

The  messages  sent  by  ST  have  varying  security  and  error  rate  requirements.  Indeed,  the 
requirements  may  vary  between  the  different  fields  of  a  message.  ST  uses  the  following 
mechanisms: 

•  Cleartext:  the  message  is  sent  verbatim,  with  no  additional  data. 

•  Encryption:  part  or  all  of  the  message  is  encrypted  using  DES  single-key  encryp¬ 
tion  [16].  A  channel  key  shared  by  the  two  hosts  is  used.  The  64-bit  remainder  of 
the  encryption  is  appended  to  the  cyphertext  This  provides  privacy,  authentication 
and  error  detection. 

•  Cryptographic  Checksumming:  pan  or  all  of  the  message  is  sent  in  cleanext,  but 
the  64-bit  remainder  from  its  DES  encryption  is  appended.  The  provides  a  ‘  ‘crypto¬ 
graphic  checksum”  having  the  propeny  that  it  is  virtually  impossible  to  modify  the 
data  without  modifying  the  checksum,  so  this  mechanism  provides  both  error  detec¬ 
tion  and  authentication. 

•  Checksumming:  pan  or  all  of  the  message  is  sent  in  cleanext,  but  is  followed  by  a 
32-bit  checksum.  This  provides  error  detection.  However,  it  does  not  provide 
authentication  (even  if  the  checksum  were  encrypted)  because  it  is  possible  to 
modify  the  data  in  such  a  way  that  the  checksum  is  unchanged. 

»  Encrypted  Trailer:  this  technique  is  used  only  when  communicating  on  a  PBN.  A 
64-bit  trailer ,  encrypted  with  the  channel  key,  is  appended  to  the  entire  message 
(not  just  to  the  pan  being  authenticated).  The  trailer  is  a  32-bit  sequence  number 
followed  by  a  32-bit  timestamp  in  seconds.  The  receiver  decrypts  the  trailer,  and 
accepts  the  message  if  it  lies  within  an  acceptable  range  (in  terms  of  both  sequence 
number  and  time)  of  the  previous  packet  received.  This  provides  authentication. 
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These  properties  are  summarized  in  Table  7.1. 


technique 

privacy 

authentication 

error  detection 

cleartext 

no 

no 

no 

encryption 

yes 

yes 

yes 

checksum 

no 

no 

yes 

cryptographic  checksum 

no 

yes 

yes 

encrypted  trailer 

no 

yes 

no 

Table  7.1;  Properties  of  Security  Mechanisms. 


The  combination  of  mechanisms  used  for  a  particular  message  (or  part  of  a  message)  is 
called  its  security  encapsulation.  The  ST  can  use  any  combination  of  the  above  mechan¬ 
isms  to  achieve  the  needed  properties.  The  choice  depends  on  what  is  provided  by  the 
network  layer.  The  first  four  techniques  are  mumally  exclusive.  Encrypted  trailers  may 
be  used  alone  or  in  combination  with  checksumming. 

ST  is  free  to  use  any  security  encapsulation  that  has  the  needed  properties.  The  choice 
may  depend  on  the  architecture  (processor  and  encryption/checksumming  hardware). 
Assume  that  1)  in  order  of  increasing  cost,  the  techniques  are 

cleartext 
encrypted  trailer 
checksum 

cryptographic  checksum 
encryption 

2)  that  the  cost  differences  are  all  nonnegligible,  and  3)  that  techniques  can  be  intermixed 
within  a  message  with  no  additional  cost.  With  these  assumptions,  the  most  efficient 
security  encapsulation  can  be  determined  as  follows; 
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if  (ST  channel  is  private  and  network  channel  is  not  private)  ( 
use  encryption; 

}  else  { 

if  (ST  channel  is  authenticated  and  network  channel  is  not  authenticated) 
if  (network  is  PBN)  { 

if  (ST  channel  error  rate  <  network  channel  error  rate)  { 
use  encrypted  trailers  and  checksuinming; 

}  else  { 

use  encrypted  trailers; 

} 

}  else  ( 

use  cryptographic  checksumming; 

} 

)  else  { 

if  (ST  channel  error  rate  <  network  channel  error  rate)  { 
use  checksumming; 

}  else  { 

use  cleartext 

} 


) 

} 

7.2.2.  Mixed  Encapsulations 

The  security  and  reliability  requirements  for  the  header  of  ST  data  messages  (see  Section 
7.2.6)  may  differ  from  those  of  the  data  part.  Therefore  different  security  encapsulations 
may  be  used  for  the  two  parts.  For  example,  a  message  might  have  the  form 

ST  header  (cleartext) _ 

cryptographic  checksum  of  ST  header 

user  data  (cleanext) _ 

checksum  of  user  data _ 

Checksums  (regular  or  cryptographic)  of  the  two  pans  are  not  merged.  If  both  parts  use 
an  encrypted  trailer,  only  one  is  appended  to  the  message. 

7.2.3.  Encapsulation  Type  is  Implicit 

There  is  no  need  to  encode  the  security  encapsulation  of  each  message  within  the  mes¬ 
sage  itself.  The  ST  can  deduce  the  encapsulation  as  follows. 

•  Control  and  data  messages  can  be  distinguished  because  they  arrive  on  different 
network  channels,  and  hence  on  different  MPO’s. 

•  The  encapsulation  of  control  messages  is  determined  when  the  control  connection  is 
establish^. 

•  The  header  encapsulation  of  data  messages  is  determined  when  the  network  channel 
is  established.  From  the  ST  channel  remote  reference  contained  in  the  header  (see 
Section  7.2.6),  the  ST  can  determine  the  ST  channel  for  which  the  message  is  des¬ 
tined.  The  data  encapsulation  for  an  ST  channel  is  determined  when  the  channel  is 
established. 
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7.2.4.  Security  Considerations 

Recall  from  Section  6.6  that  network  channel  data  messages  are  inherently  network- 
authenticated.  There  is  therefore  no  problem  in  mixing  secure  (i.e.,  private  and  authenti¬ 
cated)  and  insecure  networks  in  the  same  host  For  example,  suppose  a  host  is  connected 
to  network  A,  which  is  secure,  and  network  B,  which  is  insecure.  If  a  secure  ST  channel 
is  established  using  a  network  channel  on  network  A,  then  no  encryption  will  be  used.  A 
malicious  host  on  network  B  can  indeed  send  forged  messages,  but  they  will  not  interfere 
with  the  secure  ST  channel. 

7.2.5.  The  ST  Control  Protocol 

The  ST  control  protocol  from  host  A  to  host  B  consists  of  a  sequence  of  synchronous 
request/reply  operations  from  A  to  Operations  from  B  to  A  may  overlap  these  opera¬ 
tions.  A  simple  retransmission  protocol  is  used  for  reliability.  The  control  protocol  also 
includes /<zyr  acknowledgements,  which  are  unreliable  unidirectional  messages. 

An  ST  control  connection  between  two  hosts  consists  of  a  pair  of  network  channels 
between  the  hosts,  one  in  each  direction,  each  created  by  its  sending  end.  All  ST  control 
messages  are  sent  on  these  channels.  The  ST  control  protocol  involves  relatively  small 
and  infrequent  messages.  Hence  the  network  channels  can  have  a  small  capacity,  but 
should  have  minimal  delay. 

A  host  A  may  initiate  the  creation  of  a  control  connection  to  a  host  B  simply  by  creating 
the  initial  network  channel  to  B.  B  may  accept  the  connection  by  creating  a  network 
channel  to  A,  or  may  reject  it  by  rejecting  the  original  channel  creation  request  No  syn¬ 
chronization  problem  arises  if  two  hosts  simultaneously  create  a  control  connection  to 
one  another. 

A  subtransport  secure  connection  exists  between  two  ST  modules  A  and  B  if: 

•  A  and  B  have  a  means  for  sending  private  and  host- authenticated  messages.  In 
some  cases  this  can  be  done  by  using  private  or  host-authenticated  network  chan¬ 
nels.  More  generally,  the  ST  modules  must  have  agreed  upon  a  common  secret 
encryption  key. 

•  A  and  B  have  a  means  for  certifying  to  each  other  what  owners  (kernel  clients)  they 
represent  They  have  this  means  if  1)  they  have  agreed  upon  on  a  pair  of  owner 
certification  strings  that  can  be  used  to  prove  the  possession  of  owner  private  kevs, 
or  2)  they  trust  each  other. 

A  secure  connection  is  an  extension  of  a  control  connection;  it  is  established  only  after  a 
control  connection  has  been  established,  and  ceases  to  exist  if  the  control  connection 
fails.  Secure  connection  establishment  is  done  by  message  exchange  using  Diffie- 
Hellman  public-key  encryption  [2].  Once  the  secure  connection  has  been  established, 
DES  single  key  encryption  is  used.  The  first  operation  on  a  control  connection  is  secure 
connection  establishmenL  This  operation  is  always  initiated  by  the  host  with  the  lexico¬ 
graphically  greater  name.  To  establish  a  secure  connection  to  host  B,  host  A  generates  a 
random  channel  secret  key  5  and  sends  the  following  request  message: 


If  experiments  prove  this  serialization  to  be  a  performance  bottleneck,  the  protocol  will  be  changed 
to  allow  parallel  requests. 
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/ /  the  connection  secret  key 
//  name  of  destination  host 


Cert_string  is  a  ran¬ 
dom  string  which  will  be  used  to  certify  owners  from  5  to  A  (see  Section  1. 1.5.2). 
Encapsulation  indicates  the  type  of  security  encapsulation  to  be  used  for  future  con¬ 
trol  messages  and  data  message  headers  from  A  to  The  entire  request  is  then 

1 _ 


key;  //  the  channel  key 

dest_name;  //  the  name  of  the  destination 

cert_string; 
encapsulation; 
checksiun; 

)  ESTABLISH_SECURE_CHANNEL_REPLY; 

Key  and  dest_name  are  encrypted  with  B's  private  key.  Cert_string  is  a  ran¬ 
dom  string  which  will  be  used  to  certify  owners  from  A  to  5  (see  Section  1. 2.5.1). 
Encapsulation  indicates  the  type  of  security  encapsulation  to  be  used  for  future  con¬ 
trol  messages  from  B  to  A.  The  entire  reply  message  is  encrypted  with  A ’s  public  key. 

In  both  the  request  and  reply  messages,  the  presence  of  the  destination  name  encrypted 
with  the  sender’s  private  key  provides  authentication,  while  the  encryption  of  the  entire 
message  with  the  receiver’s  public  key  provides  privacy.  This  use  of  slow  public  key 
encryption  allows  the  ST  to  “bootstrap”  into  the  faster  single-key  encryption  using  the 
connection  key. 

7.2.5.I.  Control  Message  Structure 

For  control  messages  other  than  secure  connection  establishment,  the  security  encapsula¬ 
tion  depends  on  the  parameters  of  the  underlying  network  channel.  The  bit  error  rate  for 
control  messages  must  satisfy  an  upper  bound  determined  by  the  two  hosts.  In  addition, 
control  messages  are  host-authenticated.  They  may  also  be  private;  this  is  a  parameter  of 
the  host. 


cncrypiea  wim  a  s  puouc  Key. 

The  reply  message  has  the  form: 

i>^g_typedef  struct  { 

SKE_KEY 

BYTES 

CERT_S THING 

S  ECURI TY_ENC AP  SULAT I ON 

CHECKSUM 


msg_typedef  struct  { 

SKE_KEY  key; 

BYTES  dest_name; 

CERT_STRING  cert_string; 

SECURITY_ENCAPSULATION  encapsulation; 
CHECKSUM  checksum; 

)  ESTABLISH_SECURE_CHANNEL_REQUEST; 

i'^'?_typedef  struct  { 

enum  {  CLEARTEXT, 

CHECKSUM, 

CRYPTO_CHECKSUM, 

ENCRYPTED 

}  data_mode; 

flags  {encrypted_trailer} ; 

)  SECURITY_ENCAPSULATION; 

Key  and  dest_name  are  encrypted  with  A’s  private  key. 


"  There  could  also  be  a  negotiation  between  A  and  B  to  determine  the  encapsulations  of  control  mes¬ 
sages  and  data  message  headers,  perhaps  independently.  This  could  also  be  done  on  a  per-network-channel 
basis. 
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The  following  header  is  common  to  all  ST  control  protocol  messages: 

nisg_typedef  enum  { 

CERTIFY_ASK 
CERTIFY_OFFER, 

CERT IFY_AS  K_AGAIN 
CREATE_CHANNEL , 

DELETE_CHANNEL , 

FAST_ACK, 

PING 

}  ST_CONTROL_OP ; 

nisg_typecief  struct  { 

flags  {request}; 

ST_CONTROL_OP 
U32 
U32 

}  ST_CONTROL_HDR; 

My_seqno  is  the  sequence  number  of  the  next  operation  to  be  generated  by  the  sender 
(or  the  current  operation,  in  the  case  of  a  request  message).  Your_seqno  is  the 
sequence  number  of  the  next  operation  expected  from  the  peer  (or  the  operation  being 
replied  to).  This  header  is  followed  by  a  message  body  whose  structure  depends  on  the 
message  type. 

The  ST  control  protocol  uses  a  simple  retransmission  policy.  A  request  message  is 
periodically  retransmitted  until  a  reply  is  received.  Duplicate  request  messages  are 
ignored.  There  are  no  reply  acknowledgements.  A  reply  message  is  retransmitted  only  if 
a  message  is  received  with  a  sequence  number  indicating  that  the  reply  was  lost. 

Owner  Certification 

An  owner  O  is  said  to  be  certified  from  host  A  to  host  B  if  B  believes  that  A  possesses  O's 
private  key.  An  owner  O  becomes  certified  from  A  to  5  if  either 

(1)  B  trusts  A,  and  A  informs  B  that  it  has  O’s  private  key,  or 

(2)  A  has  proved  to  B  that  it  possesses  O’s  private  key.  To  do  this,  A  encrypts  the  pair 

<R^>  with  the  O’s  private  key,  where  R  is  the  certification  string  provided  by  B  on 

secure  connection  establishment. 

7.2,5.3.  Certification  Caching 

When  an  owner  O  has  been  certified  from  A  to  B,  both  A  and  5  will  each  have  an 
OWNER  object  for  O,  and  each  will  have  issued  a  remote  reference  to  the  other  for  their 
version  of  this  object.  A’s  remote  reference  table  entry  for  this  object  has  flags  indicating 
whether  O  has  been  certified  1)  from  A  to  B,  and  2)  from  5  to  A.  These  entries  serve  as 
an  owner  certification  cache.  Subsequent  operations  requiring  that  O  be  certified  from  A 
to  B  can  simply  consult  this  cache. 

Each  remote  reference  entry  also  contains  an  invalid  time  beyond  which  the  cenification 
from  the  peer  (if  any)  is  no  longer  valid  (this  value  is  obtained  from  the  name  service 
entry  for  the  owner).  If  a  reference  is  made  beyond  this  time,  then  the  owner  must  be 
recenified.  If  the  user  public  key  has  not  changed  or  the  owners  are  mutually  trustful, 
then  recertification  is  not  done. 


//  true  iff  request;  else  reply 

operation; 

my_seqno; 

your_seqno; 
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7.2.5.4.  Certification  Messages 

The  CERTIFy_OFFER  operation  is  used  to  offer  an  (unsolicited)  certificate  to  another 
host. 

msg_typedef  struct  { 

ST_C0NTR0L_HDR 
REMOTE_REF 
BYTES 

CERTIFICATE 
}  CERTIFY  OFFER  REQ; 


header; 

{ accepted} ; 
ref  ; 

If  the  sender  already  has  a  remote  reference  to  the  owner,  it  is  passed  in  owner_ref 
and  owner_name  is  empty;  otherwise  owner_reef  is  NULL_REF  and  the  owner 
name  is  passed  explicitiy,  in  which  case  a  remote  reference  is  returned  in  the  reply. 
Certificate  is  the  certification  string,  encrypted  with  the  owner’s  private  key,  or 
empty  if  the  peer  trusts  this  host. 

The  CERTIFY_ASK  operation  is  used  to  request  that  the  peer  authenticate  an  owner. 

msg_typedef  struct  ( 

ST_CONTROL_HDR  header; 

REMOTE_REF  ovmer_ref;  //  owner  remote  reference  (if  known) 

BYTES  owner_name;  //  symbolic  owner  name 

}  CERTIFY_ASK_REQ; 

msg_typedef  struct  { 

ST_CONTROL_HDR 
enum  {OK,  FAILED} 

REMOTE_REF 
CERTIFICATE 
)  CERTIFY_ASK_REPLY; 

The  CERTIFY_ASK_AGAIN  operation  is  used  to  request  re-certification  of  an  owner. 

msg_typedef  struct  { 

ST_CONTROL_HDR  header; 

REMOTE_REF  owner_ref; 

}  CERTIFY_ASK_AGAIN_REQUEST; 

msg_typedef  struct  { 

ST_CONTROL_HDR 
enum  {OK,  FAILED) 

CERTIFICATE 

}  CERTIFY_ASK_AGAIN_REPLY; 

7.2.5.5.  ST  Channel  Creation 

Each  ST  channel  is  multiplexed  onto  an  existing  network  channel.  An  ST  module  can 
create  an  ST  channel  only  on  a  network  channel  that  it  owns. 

A  channel  creation  request  supplies  the  following  parameters  for  each  channel  being 
requested: 


header; 

status; 

certificate; 


header; 
status; 
owner_ref ; 
certificate; 


msg_typedef  struct  { 
ST_CONTROL_HDR 
flags 

REMOTE_REF 

}  CERTIFY  OFFER  REPLY; 


header; 
owner_ref ; 
owner_name ; 
certificate; 
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msg_typedef  struct  { 

CHANNEL_DIRECTION  direction;  //  CHANNEL_OUT  or  CHANNEL_IN 

CHANNEL_P ARAMS  params; 

REMOTE_REF  net_channel_ref ; 

REMOTE_REF  st_channel_ref ; 

U32  opaque_data;  //  passed  from  active  to  passive 

}  ST  CONTROL  CHANNEL  DESC; 


Net_channel_ref  indicates  which  data  network  channel  is  to  be  used  for  the  ST 
channel;  it  must  be  a  channel  owned  by  the  sender.  St_channel_ref  is  a  reference 
to  the  local  channel  endpoint  object. 


The  create  request  message  has  the  following  structure: 


i*isg_typedef  struct  { 
ST_CONTROL_HDR 
U32 

ENC AP  SULAT I ON_T YP  E 

REMOTE_REF 

U32 

}  ST  CONTROL  CREATION  REQUEST; 


header; 
opaque_data; 
encapsulation_type; 
notif ication_mpo; 
number; 


This  is  followed  by  a  sequence  of  ST_CONTROL_CHANNEL_DESC  fields. 
Opaque_data  will  be  included  in  the  notification  message  delivered  on  the  passive 
side  to  the  notif  ication_mpo. 

The  reply  message  has  the  following  structure: 

msg_typedef  struct  { 

ST_CONTROL_HDR  header; 

enum  {ACCEPT,  REJECT}  status; 

U32  opaque_data;  //  from  passive  to  active 

}  ST  CONTROL  CREATION  REPLY; 


If  the  request  was  accepted,  this  is  followed  by  a  sequence  of  REMOTE_REF  and  U32 
fields,  referring  to  the  ST  endpoint  objects  at  the  passive  end  and  the  opaque  data  passed 
from  the  passive  to  active  client  for  each  channel. 


1. 1.5.6.  ST  Channel  Deletion 

The  DELETE_CHANNEL  operation  uses  the  following  messages: 

msg_typedef  struct  { 


ST_CONTROL_HDR 

header; 

U32 

number; 

}  DELETE_CHANNEL_REQUEST; 

msg_typedef  struct  { 

ST  CONTROL  HDR 

header; 

}  DELETE_CHANNEL_REPLY; 

The  request  message  is  followed  by  a  list  of  remote  references  to  the  ST  channels  to  be 
deleted. 


7. 2.5. 7.  Fast  Acknowledgements 

A  fast  acknowledgement  message  has  the  following  structure: 
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®sg_typedef  struct  { 

ST_CONTROL_HDR  header; 

REMOTE_REF  st_channel_ref ; 

U32  opaque_data; 

}  FAST_ACK; 

Such  a  message  acknowledges  the  receipt  of  a  client  message  with  the  given  opaque  data 
on  the  ST  channel  identified  by  st_channel_ref. 


7.2.6.  The  ST  Data  Protocol 


The  ST  data  protocol  uses  a  set  of  network  channels  disjoint  fix)m  the  control  connection 
channels.  There  are  two  types  of  ST  data  messages: 

•  A  simple  message:  used  to  send  one  ST  client  message. 

•  A  fragment  message:  used  to  send  a  fragment  of  an  ST  client  message.  This  is 
needed  when  the  client  message  (with  headers)  exceeds  the  network  channel  max¬ 
imum  message  size. 


An  ST  data  message  consist  of  an  5T  header  followed  by  user  data.  The  security  encap¬ 
sulation  is  determined  as  follows:  the  ST  header  must  have  a  bit  error  rate  below  a 
system-defined  value,  and  must  be  authenticated.  The  user  data  must  satisfy  the  ST 
channel  parameters.  The  ST  header  of  a  simple  data  message  has  the  following  structure: 


msg_typedef  struct  { 

S  T_D AT A_T YP  E  t  ype ; 

flags  (fast  ack,  auth  sender} 


U32 

U32 

REMOTE_REF 

REMOTE_REF 

REMOTE_REF 

U32 

}  ST  DATA  SIMPLE; 


message_f lags; 

st_seq_num; 

ack_id; 

dest_st_channel; 
dest_mpo; 
sender; 
data  size; 


//  SIMPLE  in  this  case 


//  a  unique  ID  for  this  message 
//  for  fast  acks 


Dest_st_channel  refers  to  the  receiving  ST  channel  endpoint  object.  Dest_mpo 
refers  to  the  MPO  to  which  the  message  is  to  be  delivered.  If  it  is  NULL_REF,  the  mes¬ 
sage  is  delivered  to  the  default  MPO.  The  optional  sender  refers  to  the  OWNER 
object  responsible  for  sending  the  message. 


The  header  of  a  fragment  message  has  the  following  structure: 


msg_typedef  struct 
ST_DATA_TYPE 
U32 
U32 
U32 
U32 

REMOTE_REF 

REMOTE_REF 

REMOTE_REF 

U32 

}  ST  DATA  FRAG; 


type; 

total_frags; 
f rag_num; 
st_seq_num; 
ack_id; 

dest_st_channei; 
dest_mpo; 
sender; 
data  size; 


//  FRAGMENT  in  this  case 
//  number  of  fragments  in  message 
/  /  number  of  this  fragment 
//a  unique  ID  for  this  message 
//  for  fast  acks 
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8.  THE  REMOTE  OPERATION  FACILITY 

The  DASH  Remote  Operation  Facility  (ROF)  provides  host-to-host  rcquest/reply  com¬ 
munication.  It  supports  higher-level  request/reply  communication  (such  as  service 
access  by  user  processes),  as  well  as  direct  kernel  communication.  The  following  are  the 
most  important  features  of  ROF: 

•  The  channel  approach  is  used:  low-delay  channels  are  used  for  critical-path  mes¬ 
sages  such  as  requests  and  replies,  and  high-delay  channels  are  used  for  other  mes¬ 
sages  such  as  retransmissions  and  acknowledgements. 

•  There  is  no  restriction  on  the  number  of  outstanding  remote  operations  between  two 
DASH  hosts.  This  removes  a  possible  limit  on  the  parallelism  within  a  multiproces¬ 
sor  host. 

•  ROF  does  not  dictate  any  mechanism  for  accessing  messages  (e.g., 
serialization/deserialization).  Clients  and  servers  access  messages  directly  using 
DML,  the  DASH  Message  Language  (see  Appendix  I). 

8.1.  Reliability 

ROF  provides  three  semantics  for  remote  operations: 

Exactly  Once 

In  the  absence  of  machine  or  network  failures,  operations  are  executed  exactly  once, 
and  in  any  case  are  not  executed  more  than  once.  Operations  may  have  reply  mes¬ 
sages. 

At  Least  Once 

In  the  absence  of  machine  or  network  failures,  operations  are  executed  at  least  once, 
and  possibly  more  than  once.  Operations  may  have  reply  messages.  This  operation 
type  can  be  used  for  idempotent  operations. 

Maybe 

Operations  are  executed  0  or  1  times;  the  client  is  not  told  which.  There  can  be  no 
reply  message.  This  operation  type  can  used  to  distribute  hints  or  other  non-critical 
information. 

8.2.  Remote  Operation  Opcodes 

A  remote  operation  (RO)  is  identified  by  a  32-bit  RO  opcode.  The  set  of  RO  opcodes  is 
divided  as  follows: 

•  Mandatory  operations  that  must  be  supported  on  all  DASH  hosts. 

•  Optional  operations  whose  semantics  are  globally  defined,  but  that  need  not  be 
implemented  on  all  hosts. 

•  Non-standard  operations,  which  are  specific  to  a  particular  kernel  type. 

8.3.  Invoking  a  Remote  Operation 

ROF  is  represented  by  an  ROF  module,  which  provides  the  following  interface  for 
invoking  an  RO: 
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RO_STATUS 
ROF : ; ro  call  { 


HOST* 

destination. 

U32 

ro_opcode. 

//  RO  opcode  -  discussed  above 

MESSAGE* 

request. 

//  request  message 

MESSAGE** 

reply. 

//  reply  message  (optional) 

RO_TYPE 

ro_type 

//  EXACTLY_ONCE,  AT_LEAST_ONCE, 

)  ; 

enum  RO_STATUS  { 

OK, 

INVALID_OPCODE, 

OPCODE_NOT_SUPPORTED , 

NO_CONNECTION,  //  unable  to  establish  a  ROF  connection 

ERROR_OTHER, 

}; 

If  the  reply  is  NULL  and  the  status  is  OK,  there  was  no  reply.  ROF  clients  may  use  the 
AUTHENTICATE_SENDER  and  AUTHENTICATE_RECEIVER  flags  in  request  and 
reply  messages  (See  section  7.1.5). 

8.4.  ROF  Server  Interface 

A  REQ_REPLY_MPO  object  is  associated  with  each  RO  opcode  using  the  following 
operation: 
void 

ROF : : regi3ter_ro_mpo  ( 

U32  ro_opcode, 

REQ_REPLY_MPO*  ro_mpo 

)  ; 

When  a  request  is  received,  the  ROF  module  looks  up  the  RO  opcode.  If  it  is  invalid  or 
is  an  operation  the  server  does  not  support,  ROF  sends  a  reply  message  with  the 
appropriate  error  code.  Otherwise,  a  request_reply  ()  operation  is  performed  on 
the  associated  MPO. 

8.5.  ROF  Connection  Parameters 

Communication  between  peer  ROF  modules  uses  a  dedicated  set  of  ST  channels  (see 
Section  8.6).  ROF  does  not  support  fragmentation  of  request  or  reply  messages.  The 
maximum  request  and  reply  message  size  is  determined  by  the  maximum  message  size  of 
the  ST  channels  that  ROF  is  using.  The  following  functions  return  these  sizes: 
int 

ROF ; :max_request_message  ( 

HOST*  remote_host  //  destination  host 

)  ; 
int 

ROF : :max_reply_message  ( 

HOST*  remote_host  //  destination  host 

)  ; 

ROF  client  and  servers  may  have  differing  security  (authentication  and  privacy)  needs. 
For  simplicity,  ROF  uses  private  and  host-authenticated  channels. 


MAYBE 
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8.6.  The  ROF  Protocol 

8.6.1.  ROF  Connections 

All  ROF  communication  between  a  particular  client/server  host  pair  uses  a  dedicated  set 
of  ST  channels  called  the  ROF  connection.  The  connection  is  created  by  the  client  ROF 
module.  A  ROF  connection  is  directional;  it  is  outgoing  relative  to  the  client,  incoming 
relative  to  the  server.  If  two  ROF  modules  are  each  acting  as  a  server  for  the  other,  then 
there  must  be  two  ROF  connections  between  them. 

Separate  fast  and  slow  channels  are  used  in  each  direction  in  order  to  reflect  the  relative 
deadlines  of  messages.  A  ROF  connection  consists  of  four  channels:  the 
FAST_REQUEST_CHANNEL  and  SLOW_REQUEST_CHANNEL  go  from  the  client  to 
the  server,  and  the  FAST_REPLY_CHANNEL  and  SLOW_REPLy_CHANNEL  go  from 
the  server  to  the  client.  Initial  request  and  reply  messages  always  use  the  fast  channels, 
while  retransmissions  and  acknowledgements  use  the  slow  channels. 

The  channels  in  a  ROF  connection  are  “logical”  in  that 

•  They  may  not  be  distinct:  the  fast  channel  and  the  slow  channel  in  a  given  direction 
may  actually  be  the  same  channel. 

•  There  may  be  more  than  one  actual  channel  for  each  logical  channel;  this  might  be 
necessary  if  the  capacity  of  one  ST  channel  is  insufficient. 

•  The  channels  may  change  over  the  life  of  a  ROF  connection;  if  one  of  them  is 
closed  (e.g.,  because  of  network  failure),  the  ROF  module  may  create  a  new  channel 
without  breaking  the  ROF  connection. 

The  client  ROF  module  creates  the  ROF  connection  to  a  peer  host  when  it  has  a  request 
for  that  host  and  a  ROF  connection  does  not  already  exist.  This  occurs,  for  example, 
after  one  of  the  hosts  comes  up  from  a  crash.  ST  notification  messages  are  sent  to  the 
ROF  notification  port,  which  has  a  well-known  remote  reference  (see  Section  5). 

ST  allows  the  creator  of  an  ST  channel  to  pass  opaque  data  in  the  notification  message. 
ROF  uses  this  to  allow  the  server  to  distinguish  between  the  different  channels  of  the 
ROF  connection;  there  is  a  different  code  for  each  of  the  four  channels.  In  addition,  a  bit 
is  used  to  specify  whether  or  not  this  is  a  new  connection.  This  is  necessary  to  support 
reestablishment  of  individual  channels  of  the  ROF  connection. 

The  ROF  client  owns  the  channels  in  the  ROF  connection,  and  it  may  delete  the  connec¬ 
tion.  It  does  so  by  using  SUBTRANSPORT :: delete_channel  ( )  to  delete  the 
channels.  No  messages  need  to  be  passed  between  the  ROF  modules.  The  server  may 
request  to  delete  the  ROF  connection.  This  may  be  necessary  if  the  server  is  going  down. 

8.6.2.  ROF  Messages 

A  remote  operation  (RO)  may  involve  a  request  message,  a  reply  message,  and  various 
retransmissions  and  acknowledgement  messages.  An  ROF  transaction  is  the  set  of  all 
messages  associated  with  a  single  RO. 

ROF  messages  are  of  two  types:  ROF  client  messages  sent  by  the  ROF  client,  and  ROF 
server  messages  sent  by  the  ROF  server,  A  ROF  message  consists  of  two  pans:  a  header 
containing  control  information,  optionally  followed  by  data.  The  two  pans  are  sent 
together  as  a  single  message  on  an  ST  channel. 


37 


There  are  two  header  formats,  one  for  client  messages  and  one  for  server  messages.  To 
simplify  message  handling,  each  header  type  is  fix^-size,  regardless  of  the  actud  fields 
used. 

8.6.2.1.  Client  Message  Header  Format 

Qient  messages  headers  have  the  following  format; 

™sg_typedef  flags  { 

EXACTLY_ONCE_REQUEST , 

AT_LEAS  T_ONCE_REQUE  S  T , 

MAYBE_REQUEST, 

EXP_ACK_REQUESTED , 

PING_REQUEST, 

REPLY_ACK, 

DELETE_ACK 
)  CLIENT_FLAGS ; 

rr^g_typedef  struct 
CLIENT_FLAGS 
U32 
U32 

REMOTE_REF 
}  ROF_CLIENT_HEADER; 

Sequence  numbers  are  generated  by  the  client  to  identify  each  ROF  transaction.  The  first 
transaction  on  a  ROF  connection  may  have  any  sequence  number.  Subsequent  transac¬ 
tions  must  have  strictly  increasing  sequence  numbers.  Sequence  numbers  may  not 
repeat.  Client_mpo  is  the  remote  reference  to  the  MPO  where  the  reply  message  is 
to  be  delivered  to  the  client. 

8.6.2.2.  Server  Message  Header  Format 
Server  message  headers  have  the  following  format: 

msg_typedef  flags  { 

REPLY, 

REQUEST_ACK, 

DELETE_REQ 
}  SERVER_FLAGS; 

msg_typedef  enum  { 

SUCCESSFUL, 

INVALID_RO_OP , 

RO_OP_NOT_SUPPORTED , 

}  RO_STATUS; 

msg_typedef  struct  { 

SERVER_FLAGS  server_f lags ; 

RO_STATUS  status; 

U32  segno; 

}  ROF  SERVER  HEADER; 


client_f  lags ; 
ro_opcode; 
segno; 
client_mpo; 


//  message  type 
//  which  operation 
//  ROF  transaction  ID 
//  where  to  send  reply 


8.6.3.  ROF  Protocol  Specifications 

The  ROF  exactly-once  protocol  uses  implied  request  acknowledgements;  that  is,  a  reply 
acknowledges  the  corresponding  request.  Other  systems  such  as  Sprite  [23]  and  Cedar 
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[5]  also  use  implied  reply  acknowledgements  (i.c.,  a  request  message  acknowledges  the 
previous  reply).  Implied  reply  acknowledgements  are  not  used  in  ROF. 

Tables  7.1  and  7.2  specify  the  ROF  exactly  once  protocol.  Tables  7.3.  and  7.4  specify 
the  ROF  at  least  once  protocol.  It  is  assumed,  for  simplicity,  that  the  ROF  connection 
has  already  been  established,  and  that  there  are  no  machine  or  network  failures. 

9.  THE  SERVICE  ACCESS  MECHANISM 

Services  are  a  class  of  remotely-accessible  logical  resources  in  the  DASH  distributed  sys¬ 
tem  architecture.  The  DASH  service  access  mechanism  (SAM)  allows  clients  to  name 
services  and  communicate  with  them  in  a  uniform  way.  The  goals  of  SAM  are: 

•  To  provide  replication  transparency.  A  service  may  consist  of  multiple  instances 
running  on  different  hosts.  A  client  need  not  know  which  instance  handles  a  partic¬ 
ular  request  or  session. 

•  To  provide  location  transparency  in  the  sense  that  service  names  do  not  specify  or 
limit  the  location  of  the  servers. 

•  To  provide  failure  transparency:  If  a  service  instance  fails,  SAM  may,  without 
client  involvement,  locate  and  begin  using  a  second  instance  of  the  service. 

•  To  provide  a  flexible  framework  for  client/service  communication  protocols.  Ser¬ 
vices  may  provide  interfaces  that  have  real-time  communication  performance 
requirements,  or  that  use  special-purpose  stream  protocols. 


Event 

Cumcni  State 

Receive 

Reply 

Receive 

ACK 

'Hmeout 

STATE  1 

Initial  Request  Sent. 

No  explicit  ACK  requested. 

Send  Reply  ACK. 

Goto  State  4. 

NA. 

Send  Duplicate  Request. 
Ask  For  Explicit  ACK. 
Goto  State  2. 

STATE  2 

Duplicate  Request  Sent. 

Explicit  ACK  Requested. 

Send  Reply  ACK. 

Goto  State  4. 

Goto  Sute  3. 

Send  Duplicate  Request. 
Explicit  ACK  Requested 
Goto  Slate  2. 

STATES 

Request  Sent 

ACK  of  Request  Received. 

Send  Reply  ACK. 
Goto  Slate  4. 

Goto  Sute  3. 

Send  Ping  Message. 
Goto  State  3. 

STATE  4 

Received  Reply. 

Send  Reply  ACK. 
Goto  State  4 

NA. 

NA. 

Table  7.1:  Client  State  Table,  ROF  Exactly-Once  Protocol. 
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Event 

Current  State 

Receive 

Request 

Receive 

Request 

ACK  Reouested 

Receive 

Reply  ACK 

Receive 

PING  Request 

Finish 

Executing 

RO 

STATE  1 

Idle. 

Execute  RO. 

Goto  State  2. 

Send  Request  ACK. 
Execute  RO. 

Goto  State  2. 

NA. 

NA. 

NA. 

STATE  2 

Executing  RO. 

Send 

Request  ACK. 
Goto  State  2. 

Send 

Request  ACK. 
Goto  State  2. 

NA. 

Send 

Request  ACK 
Goto  State  2. 

Send  Reply. 

Goto  State  3. 

STATES 

Reply  Sent 

No  ACK  Received. 

Send 

Reply. 

Goto  State  3. 

Send 

Reply. 

Goto  State  3. 

Goto  State  4. 

Send 

Reply. 

Goto  Slate  3. 

NA. 

STATE  4 

RO  transaction 
Complete. 

NA. 

NA. 

NA. 

NA. 

NA. 

Table  7.2:  Server  State  Table,  ROF  Exactly-Once  Protocol. 


Table  7.3:  Client  State  Table,  ROF  At-Least-Once  Protocol. 
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Event 

CaiTcnt  State 

Receive 

Request 

Receive 

Request 

ACK  Reouested 

Receive 

PING  Request 

Finish 

Executing 

RO 

STATE  1 

Idle. 

Execute  RO. 

Goto  State  2. 

Send  Request  ACK. 
Execute  RO. 

Goto  Stale  2. 

NA. 

NA. 

STATE  2 

Execnting  RO. 

Send 

Request  ACK. 
Goto  State  2. 

Send 

Request  ACK. 
GoioStateZ 

Send 

Request  ACK. 
Goto  State  2. 

Send  Reply. 

Goto  State  1. 

Table  7.4;  Server  State  Table,  ROF  At-Least-Once  Protocol. 


9.1.  The  Service  Abstraction 

A  DASH  service  is  a  set  of  instances  that  together  form  a  logical  resource.  Each  instance 
resides  on  a  single  host  An  instance  may  consist  of  a  process,  a  set  of  processes,  or  a 
“registration”  with  the  host  kernel  that  causes  a  process  to  be  created  as  needed.  Infor¬ 
mation  about  the  services  on  a  host  are  kept  in  stable  storage  (e.g.,  a  configuration  file)  so 
that  they  survive  crashes. 

A  replicated  service  may  provide  an  abstraction  of  consistent  data,  in  which  case  it  needs 
to  ensure  consistency  between  its  instances.  DASH  does  not  supply  nor  dictate  any 
method  for  this,  or  for  ensuring  the  atomicity  or  permanence  of  operations  on  services. 
Such  mechanisms  must  be  supplied  by  the  services  themselves,  perhaps  in  cooperation 
with  a  higher-level  transaction  manager. 

A  DASH  host  may  provide  a  kernel  service,  allowing  access  to  resources  that  are 
inherently  local  to  that  host,  such  as  its  physical  devices. 

Services  can  be  accessed  in  two  basic  modes: 

RequestlReply  mode 

Operations  are  request/reply.  Operations  are  conveyed  to  remote  service  instances 
via  ROF. 

Session  mode 

SAM  uses  ROF  to  contact  an  instance  of  the  service  and  set  up  a  communication 
channel  between  the  client  and  server.*^  This  allows  clients  and  servers  to  commun¬ 
icate  using  specialized,  dynamically  configurable  protocols. 

9.1.1.  Service  Tokens 

A  service  may  issue  a  service  token  representing  a  name  or  object  within  the  service. 
The  token  can  thereafter  be  supplied  in  lieu  of  a  name  in  subsequent  operations  on  the 
service.  A  service  token  may  be  used  only  in  accessing  the  service  instance  that  issued 

The  design  for  session  mode  access  is  not  complete. 
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the  token. 

A  service  token  has  an  associated  set  of  operations,  specified  (by  a  bitmask)  when  the 
token  is  requested;  the  token  provides  the  right  to  perform  this  set  of  operations  on  the 
object  to  which  it  refers,  bypassing  any  underlying  protection  mechanism.  A  token  may 
have  no  access  rights,  in  which  case  it  serves  only  as  a  name  abbreviation. 

The  use  of  service  tokens  can  improve  performance  in  two  ways:  1)  it  eliminates  the 
need  for  the  service  to  do  per-operation  authorization  checking;  2)  it  eliminates  the  need 
for  the  service  to  do  per-operation  name  translation.  The  DASH  service  token  mechan¬ 
ism  is  related  to  the  V  system’s  UIO  interface  [9]. 

Service  tokens  can  serve  as  capabilities  (albeit  transient  ones),  and  therefore  must  be  pro¬ 
tected.  There  are  two  possible  approaches: 

•  The  token  includes  a  large  random  part,  is  secret,  and  must  be  encrypted  in  network 
messages. 

•  The  token  need  not  be  encrypted  in  network  messages,  and  may  be  a  small  integer. 
The  service  accepts  a  token  only  firom  its  original  recipient. 

DASH  uses  the  second  approach.  A  service  maintains,  in  per-host  tables,  the  set  of  ser¬ 
vice  tokens  it  has  issued. 

Service  tokens  may  be  discarded  at  any  time  by  a  service.  This  may  be  done  either  to 
limit  table  size,  or  to  force  periodic  reauthorization  in  support  of  an  “eventual  revoca¬ 
tion’’  policy.  The  client  (or  the  client  kernel)  holding  the  token  must  store  information 
(name  and  operation  set)  used  to  obtain  the  token,  and  must  be  prepared  to  issue  another 
token  request  if  the  original  token  is  invalidated  by  the  service.  In  addition,  the  client 
can  “release”  the  token,  providing  a  hint  to  the  service  that  it  discard  the  token. 

A  token  does  not  have  associated  with  it  any  session  context  (e.g.,  position  within  a  file). 
Two  tokens  representing  the  same  name  and  having  the  same  rights  are  interchangeable. 

Tokens  are  usable  only  during  a  crash-free  period  on  both  the  client  and  the  service 
instance.  When  a  host  loses  a  secure  channel  to  a  remote  host,  all  tokens  associated  with 
services  running  on  the  remote  host  are  discarded. 


9.2.  Client  Interface  to  SAM 


SAM  is  represented  by  a  SAM  module.  SAM  provides  the  following  function  to  per¬ 
form  an  operation: 


SAM_STATUS 
SAM: : operation  ( 

NAMED_ENTITy* 

char* 

OWNER* 

MESSAGE* 

MESSAGE** 

)  ; 


prefix,  //  NAME_SERVICE,  SERVICE,  SERVICE_TOKEN 

extension,  //  relative  to  prefix 

owner, 

request, 

reply 
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SAM_STATUS  { 

OK, 

// 

operation  was  successful 

SERVICE_FAILURE , 

// 

service  was  unavailable 

INVALID_PREFIX, 

// 

prefix  was  invalid 

INVAL1D_EXTENS ION, 

// 

syntactically  invalid  extension 

NO_SUCH_EXTENS ION, 

// 

could  not  resolve  n2une 

NO_AUTHORI Z AT I ON , 

// 

authorization  failure 

); 

Names  are  specified  using  prefix  and  extension.  Prefix  may  point  to  a 
NAMED_ENTITy  object  or  be  NULL.  Extension  is  relative  to  prefix.  If 
prefix  is  NULL,  then  extension  is  a  complete  name.  Because  services  may  do 
authorization  based  on  the  owner  name,  the  owner  requesting  the  operation  is  specified  in 
owner. 

Service  tokens  are  represented  in  the  client  kernel  by  objects  of  class 
SERVICE_TOKEN  (derived  from  NAMED_ENTITY).  SAM  provides  the  following 
operation  for  creating  these  objects: 

SAM  STATUS 

prefix,  //  NAME_SERVICE,  SERVICE,  SERVICE_TOKEN 

extension,  //  relative  to  prefix 

ovmer,  //  used  for  authorization 

operations,  //  operations  associated  with  token 

token  / /  new  token 

The  meaning  and  format  of  operations  is  service-specific.  The  new  service  token  is 
returned  in  token. 

9.3.  Server  Interface  to  SAM 

Each  instance  of  a  service  is  identified  by  a  pair  consisting  of 

•  The  name  of  the  host  on  which  it  runs. 

•  An  instance  ID,  a  32-bit  ID  unique  among  service  instances  on  that  host. 

Services  have  symbolic  names  in  the  DASH  global  name  space  (see  Section  10).  Each 
service  name  is  mapped  (by  the  DASH  name  service)  to  a  list  of  (host  name,  instance  ID) 
pairs.  A  service  may  have  several  different  names.  Only  those  instances  of  a  service  that 
are  intended  for  remote  access  need  be  listed  in  the  name  service  entry.  For  example,  a 
file  service  may  have  local  instances  on  work  stations.  These  local  instances  might  do 
local  caching,  and  never  be  accessed  remotely. 

The  steps  in  offering  a  service  that  will  run  as  a  user  process  on  the  DASH  kernel  are  as 
follows: 

(1)  Write  a  program  that  implements  the  local  control  protocol  (see  Section  9.4). 

(2)  Make  versions  of  this  program  for  the  hosts  on  which  instances  are  to  be  run. 

(3)  Register  the  service  with  the  SAM  module  on  each  of  these  hosts  (see  below). 

(4)  Register  the  service  with  the  DASH  name  service. 

Services  can  be  registered  and  unregistered  using: 


SAM: :get_service_token ( 
NAMED_ENTITy* 
char* 

OWNER* 

char* 

SERVICE  TOKEN** 
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SAM : : register ( 

REQ_REPLY_MPO*  service_Tnpo, 

SVC_ID  instance_id 

)  ; 

SAM: runregister ( 

SVC_ID  instance_id. 

)  ; 

Service_mpo  is  a  request/reply  MPO  to  which  requests  to  the  service  will  be 
delivered.  Depending  on  the  nature  of  the  service,  this  object  may: 

•  Perform  a  function  call  within  the  kernel. 

•  Deliver  the  message  to  a  user-level  server  process,  perhaps  creating  a  VAS  and  a 
process  if  necessary. 

The  SAM  module  maintains  a  table  of  service  registrations,  mapping  service  ID’s  to 
request/reply  MPO’s. 

9.4.  SAM  Protocols 

There  are  three  protocols  involved  in  service  access: 

(1)  A  protocol  between  peer  SAM  modules,  layered  on  top  of  ROF.  This  is  called  the 
SAM  network  control  protocol. 

(2)  A  protocol  between  a  server-side  SAM  module  and  a  local  service  instance.  This  is 
called  the  SAM  local  control  protocol. 

(3)  The  service-specific  end-to-end  protocol  between  a  client  and  a  service,  defining  the 
format  and  semantics  of  the  service’s  operations.  This  is  called  the  SAM 
clienti service  protocol. 

9.4.1.  Network  Control  Protocol 

The  SAM  network  control  protocol  consists  of  remote  operations  using  the  ROF  facility. 
In  the  DASH  kernel,  these  operations  are  generated  and  handled  by  SAM  modules.  In 
specialized  server  machines,  they  might  be  handled  by  the  service  itself.  The  SAM  net¬ 
work  protocol  uses  the  following  set  of  ROF  opcodes: 

SVC_TOKEN_REQUEST 

SVC_OPERATION 

The  SVC_OPERATION  operation  is  used  to  perform  an  operation  on  a  service.  The 
request  message  has  the  following  structure: 

msg_typedef  struct  { 

U32  service_ici; 

U32  service_token; 

BYTES  extension; 

BYTES  request; 

}  SVC_OPERATION_REQUEST; 

Service_token  is  a  service  token  previously  issued  by  the  service,  or  NULL. 
Extension  is  a  name  extension  beyond  that  of  the  token,  or  beyond  the  name  of  the 
service  if  no  token  is  used. 
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The  reply  message  is  either  an  operation  result  from  the  service,  an  error  return,  or  a 
REDIRECT  message  forwarding  this  request  to  another  service  instance.  In  this  case  the 
message  contains  a  (host  name,  instance  ID)  pair,  which  is  a  hint  for  which  instance  to 
try  next. 

The  SVC_TOKEN_REQUEST  operation  is  used  to  obtain  a  service  token.  It  uses  the 
following  messages: 

msg_typedef  struct  { 

U32  service_id; 

U32  old_token; 

BYTES  extension; 

}  SVC_TOKEN_REQUEST; 

msg_typedef  struct  { 

TOKEN_REPLY_STATUS  Status; 

U32  new_token; 

}  SVC_TOKEN_REPLY; 

In  the  request  message,  old_token  is  an  optional  service  token  to  which  exten¬ 
sion  is  relative. 

9.4.2.  Local  Control  Protocol 

The  protocol  between  a  server-side  SAM  module  and  a  service  instance  uses  the  same 
request/reply  messages  as  the  SAM  protocol,  except: 

•  Messages  are  in  host  byte  order  instead  of  network  byte  order. 

•  Some  error  codes  (e.g.,  NO_SUCH_ID)  are  not  used. 

10.  GLOBAL  NAMING 

A  primary  goal  of  the  DASH  communication  architecture  is  that  resources  (data  and 
computational)  should  be  uniformly  and  securely  accessible  from  any  host.  This  requires 
a  facility  for  naming  and  locating  the  resources,  and  for  naming  and  authenticating 
resource  owners  and  clients.  These  functions  are  provided  by  the  DASH  global  naming 
system. 

10.1.  Name  Space  Structure 

The  DASH  global  naming  system  uses  a  single  tree-structured  name  space,  similar  to  that 
described  in  [6].  Conceptually,  a  name  is  a  list  of  components,  each  of  which  is  a  string 
of  printable  ASCII  characters  not  containing  the  character  In  practice,  a  name  is 
represented  as  a  single  ASCII  character  string  consisting  of  a  sequence  of  components 
separated  and  preceded  by  “/”.  For  example, 

/ usa/uc-berkeley/computer-science/ filer 

is  a  name  with  four  components.  A  “/”  by  itself  constitutes  a  name  with  zero  com¬ 
ponents;  this  is  the  name  of  the  root  of  the  naming  tree. 

10.2.  Entity  Types 

The  system  is  used  to  name  four  types  of  entities:  hosts,  owners,  services  and  name  ser¬ 
vices.  The  internal  nodes  of  the  tree  represent  name  services,  and  the  leaves  of  the  tree 
represent  the  other  entity  types.  The  entity  types,  and  their  associated  attributes,  are  as 
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follows: 

•  An  owner  represents  an  individual  human  user  or  a  “role”  such  as  that  of  system 
manager.  Its  attributes  include  its  public  key. 

•  A  host  is  an  endpoint  of  physical  network  communication.  Its  attributes  include  a 
list  of  its  network  addresses,  and  the  name  of  its  owner. 

•  A  service  is  a  logical  resource  provided  by  set  of  programs  or  processes.  Its  attri¬ 
butes  include  1)  a  list  of  (host  name,  instance  ID)  pairs  each  specifying  an  instance 
of  the  service  (see  Section  9.1)  and  2)  the  name  of  the  owner  of  the  service.  Ser¬ 
vices  that  are  not  name  services  (see  below)  are  called  general  services. 

•  A  name  service  is  a  special  type  of  service  that  manages  a  single  directory.  Each 
entry  in  a  directory  has  a  name  component,  a  type,  and  a  set  of  attributes.  The  attri¬ 
butes  of  a  name  service  include  those  of  gener^  services  (i.e.,  the  service  owner  and 
the  set  of  hosts  on  which  instances  exist),  but  also  include  the  attributes  of  the  hosts, 
the  attributes  of  the  host  owners,  and  the  attributes  of  the  owner  of  the  name  service. 
This  extra  information  is  included  to  avoid  infinite  recursion  in  the  name  resolution 
process  (see  Section  10.5.3). 

In  addition  to  the  mandatory  attributes  listed  above,  each  name  service  entry  may  have 
an  arbitrary-length  character  string  for  auxiliary  information.  This  typically  would  be 
structured  as  a  set  of  name=value  string  pairs.  It  could  be  used  to  store  attributes  such  as 
the  real-life  name,  US  mail  address,  electronic  mail  server  address,  phone  number  of  an 
owner,  or  the  access  protocol  used  by  a  service. 

Each  name  service  entry  also  has  a  “cache  time”  field  indicating  the  maximum  amount 
of  time  for  which  it  should  be  cached  by  clients  (kernels  or  other  name  services).  There 
is  no  cache  consistency  protocol,  so  resolutions  may  be  incorrect  during  the  cached 
period.  Any  intermediate  agent  (e.g.,  another  name  server  or  a  kernel)  that  caches  name 
entries  should  maintain  the  amount  of  time  for  which  it  has  held  each  entry  and  invali¬ 
date  the  entry  when  the  cache  time  expires.  If  a  name  server  releases  the  entry  to  another 
agent,  it  should  replace  the  cache  time  field  with  a  suitably  reduced  value. 

10.3.  Intra-Service  Naming 

SAM  allows  services  other  than  name  services  to  extend  the  global  name  space  below 
their  own  name.  Hence  they  can  provide  global  names  for  the  objects  they  manage. 
Such  a  name  has  the  form 

service-name/intra-service-name 

where  service-name  is  the  name  of  a  general  service.  For  example,  a  file  service  might 
provide  hierarchical  naming  of  its  files,  so  that  the  name 
/usa/uc-berkeley/cs/f iler/anderson/foo 
refers  to  the  file 

/ anderson/ f oo 

within  the  file  service 

/usa/uc-berkeley/cs/f iler  . 

This  feature  removes  the  need  to  distinguish  the  two  levels  of  naming,  and  makes  it  pos¬ 
sible  for  services  to  provide  named  “sub-services”.  In  addition,  services  can  provide 
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non-hierarchical  intra-service  naming.  For  example,  a  file  service  supporting  attribute- 
based  naming  might  provide  a  name  of  the  form 

/usa/uc-berkeley/cs /f iler2 /anderson . dash . kernel . c-source . scheduler 

specifying  a  file  (or  group  of  files)  with  attributes  anderson,  dash,  kernel,  c-source,  and 
scheduler. 


10.4.  The  Interface  to  Naming  in  the  DASH  Kernel 

In  the  DASH  kernel,  the  interface  to  the  global  naming  system  is  provided  by  the  NAM¬ 
ING  module.  The  basic  function  of  this  module  is  name  resolution.  That  is,  given  a  glo¬ 
bal  name  N,  it  finds  the  longest  prefix  of  N  that  names  a  standard  entity  (owner,  host, 
general  service,  or  name  service),  and  returns  1)  the  attributes  of  that  entity,  and  2)  the 
remainder  of  N  beyond  the  name  of  the  standard  entity.  The  interface  for  resolving  a 
name  is: 


NAMING_STATUS 
NAMING :: resolve  ( 


NAME_SERVICE* 

prefix. 

// 

a  NAME  SERVICE 

or  NULL 

char* 

extension. 

// 

relative  to  prefix 

OWNER* 

owner. 

// 

used  for  authorization 

NAMED_ENTITY** 

result_pref ix. 

// 

standard  entity 

(returned) 

char** 

result_extension 

// 

name  remainder, 

could  be  NULL 

enum  NAMING_STATUS  { 

OK, 

NO_SUCH_NAME, 
INVALID_NAME, 
NO_AUTHORIZATION, 
NAME_SERVICE_FAILURE , 

); 


/ /  resolution  was  successful 
//  no  such  name  is  known 
//  syntactically  invalid  name 
//  name  service  authorization  failure 
//  a  name  service  was  unavailable 


Names  are  specified  using  a  prefix  and  an  extension.  Prefix,  if  non-NULL, 
points  to  an  object  (returned  by  a  previous  call  to  NAMING :  ;  resolve  ( ) )  represent¬ 
ing  a  name  service.  Extension  extends  the  name  represented  by  prefix.  If 
is  NULL,  extension  is  a  global  name.  Owner  specifies  the  owner  on 
whose  behalf  the  name  is  being  resolved.  This  is  relevant  if  authorization  is  used  by  any 
of  the  name  services  involved.  If  the  resolution  is  successful,  the  results  are  placed  in 
result_prefix  and  result_extension.  Resolution  fails  if  1)  the  name  is  syn¬ 
tactically  invalid,  2)  a  directory  lookup  fails,  3)  an  authorization  failure  occurs,  or  4)  a 
name  service  fails. 


10.5,  The  Name  Service  Protocol 

This  section  specifies  the  minimal  set  of  operations  that  must  be  supponed  by  a  DASH 
name  service.  These  operations  may  be  invoked  by  other  name  services  acting  on  behalf 
of  their  clients,  or  by  the  clients  themselves.  A  name  service  may  provide  other  opera¬ 
tions  as  well,  such  as  those  needed  for  administration  or  authorization  changes. 
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10.5.1.  Name  Resolution 

Every  name  service  may  be  asked  to  resolve  any  name.  In  addition  to  various  error 
returns,  the  following  results  are  possible: 

(1)  If  this  resolution  is  successful,  the  attributes  of  the  standard  entity  with  the  longest 
prefix  are  returned. 

(2)  If  the  queried  name  service  is  unable  to  authenticate  the  requesting  owner  to  a  pro¬ 
tected  intermediate  name  service  (i.e.,  because  it  does  not  have  the  owner’s  private 
key),  the  entry  for  the  intermediate  name  service  is  returned.  The  client’s  kernel 
must  then  contact  this  name  service  directly. 

The  name  resolution  messages  are  : 

n^9_typecief  struct  { 

SERVICE_TOKEN  prefix; 

BYTES  extension; 

RESOLVE_REQ_FLAGS  flags; 

}  NAMING_RESOLVE_REQUEST; 

msg_typedef  struct  { 

U32  status; 

U32  coitponent_num; 

NS_ENTRY  entry; 

}  NAMING_RESOLVE_REPLY; 

msg_typedef  flags  { 

no_cache  //  do  not  use  cached  information 

)  RESOLVE_REQ_FLAGS  ; 

10.5.1.1.  Name  Tokens 

Global  names  are  potentially  very  long.  Even  with  caching  in  name  services, 
component-by-component  resolution  may  yield  unacceptable  performance.  To  confront 
this  problem,  name  services  may  supply  service  tokens  (Section  9.1.1)  representing 
names.  This  particular  type  of  service  token  is  called  a  name  token.  Name  tokens  have 
no  associated  access  rights.  The  same  token  may  be  issued  to  any  number  of  clients,  but 
is  valid  only  for  the  name  service  instance  that  issued  it.  Typically,  a  name  service 
would  use  a  name  token  as  a  reference  into  its  cache. 

Unlike  other  service  tokens,  a  name  token  does  not  represent  a  name  that  extends  that  of 
the  issuing  name  service.  Rather,  it  represents  a  global  name. 

In  the  DASH  kernel,  the  NAMING  module  uses  the  name  token  mechanism  internally  to 
speed  name  resolution.  This  is  done  transparently  to  the  clients  of  NAMING. 

10.5.2.  Scan  Operations 

Scan  operations  are  used  to  read  the  set  of  entries  in  a  name  service’s  directory.  This 
facility  can  be  used  to  provide  for  resolution  based  on  incomplete  information,  “wild¬ 
card”  queries,  and  so  on. 


/ /  see  below 


//  OK  or  AUTHORIZATION_ERROR 
//  where  resolution  finished 
//  attributes  of  entry 


NS_ENTRy  is  not  sp>ecified.  See  Section  10.2  for  a  list  of  the  attributes  for  each  type  of  named  en¬ 
tity. 


48 


Using  flags  passed  in  the  request  message,  the  operation  can  be  limited  to  a  subset  of  the 
entry  types.  Also  using  flags,  the  information  returned  for  each  entry  can  be  limited  to  its 
type. 


Since  the  number  of  entries  in  a  directory  may  be  large,  the  scan  operation  may  return  a 
subset  of  the  entries.  The  operation  can  specify  an  initial  entry  number,  and  the  client 
can  make  a  sequence  of  scan  operations  to  scan  the  entire  directory. 

The  following  messages  are  used  to  scan  directories: 


msg_typedef  struct  { 

U32  start_index; 
flags  { 


services. 

// 

name_services , 

// 

hosts. 

// 

owners. 

// 

return_types , 

// 

return_names , 

// 

return  attrs 

// 

}  flags; 

)  SCAN  REQUEST; 


return  entries  of  type  SERVICE 
return  entries  of  type  NAME_SERVICE 
return  entries  of  type  HOST 
return  entries  of  type  OWNER 
return  type  of  each  entry 
return  naune  of  each  entry 
return  complete  attributes 


^g_typedef  struct  { 


SCAN_STATUS 

status; 

// 

U32 

num._entries; 

// 

SCAN_ENTRY 

entries [ ] ; 

// 

}  SCAN  REPLY; 


see  below 

number  of  entries  in  reply, 
array  of  information 


i^9r_typedef  struct  { 

BYTES  name; 
union  { 

NS_ENTRY 

NS_ENTRY_TYPE 

NULL; 

)  ; 


ns_entry; 

type; 


/ /  complete  attributes  requested 
//  only  entry  type  requested 
//  only  name  was  request 


)  SCAN  ENTRY; 


msg_typedef  eniun  { 

OK,  //  scan  was  successful 

NO_AUTHORIZATION  //  did  not  have  correct  authorization 
}  SCAN  STATUS; 


10.5.3.  Avoiding  Infinite  Loops  in  Name  Resolution 

The  attributes  of  a  general  service  include  1)  a  list  of  {host  name,  instance  ID)  pairs  for 
each  instance  of  the  service,  and  2)  the  name  of  the  owner  of  the  service.  The  attributes 
for  a  name  service  include  those  of  a  general  service,  but  also  include  the  attributes  of  the 
hosts  where  there  are  instances  of  the  service,  the  attributes  of  these  host’s  owners,  and 
the  attributes  of  the  owner  of  the  name  service. 

To  see  why  it  is  necessary  to  include  this  extra  information,  consider  the  following 
scenario.  Assume  that  the  name  service  /edu/davis  has  only  one  instance,  which 
runs  on  the  host  /edu/davis /host  1.  A  client  on  the  host  whose  parent  name  ser¬ 
vice  is  /edu  tries  to  resolve  the  name  /edu/davis/hostl.  The  host  asks  /edu 
to  resolve /edu/davis /host  1.  /edu  has  an  entry  for  /edu/davis.  If  /edu 
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returned  only  the  service  attributes  for  davis  (e.g.,  a  list  of  host  names  and  instance 
ID’s),  the  next  step  in  the  resolution  process  would  be  to  resolve  the  name 
/ edu /davis /host  1,  which  the  original  request  The  extra  information  outlined 
above  solves  this  problem. 
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APPENDIX  I  -  THE  DASH  MESSAGE  LANGUAGE 


1.  Introduction 

Local  and  remote  interprocess  communication  in  DASH  is  based  on  messages.  The 
transport  mechanism  views  a  message  as  a  logical  array  of  bytes.  For  the  sending  and 
receiving  clients,  however,  a  message  (or  a  portion  of  a  message)  may  be  interpreted  as  a 
collection  of  typed  data  items.  A  structured  message  is  a  (logically)  contiguous  portion 
of  a  DASH  message  that  possesses  a  well-defined  structure,  or  type. 

This  section  describes  the  DASH  facilities  for  structured  messages.  Specifically,  it 
includes: 

•  The  defimtion  of  a  DASH  Message  Language  (DML)  for  the  specification  of 
message  structure.  DML  is  not  a  general-purpose  type  definition  facility,  but  rather 
is  tailored  to  the  limited  needs  of  kernel-level  clients. 

•  A  message  representation  standard  that  dictates  how  messages  with  DML 
specifications  are  to  be  represented. 

•  A  description  of  a  preprocessor  that  converts  a  limited  set  of  DML  type  definitions 
into  a  set  of  macros  that  facilitate  building  and  accessing  messages  of  those  types. 

This  facility  is  related  to  components  of  RPC  systems  such  as  those  of  Mesa  [5],  Mach 
[12]  and  Sun  UNIX  [13].  The  main  difference  between  DML  and  these  systems  is  that 
DML  clients  (message  producers  and  consumers)  are  expected  to  access  messages 
directly,  rather  than  serialize  and  deserialize  them.  This  is  because  the  clients  are  DASH 
kernels,  which  demand  efficiency  rather  than  a  transparent  programming  interface. 

2.  Message  Type  Definitions 

DML  is  used  to  define  named  message  types.  The  definition  of  a  new  type  has  the  form 
insg_typedef  type-definition  type-name; 

where  type-definition  is  either  the  name  of  an  existing  type  (such  as  the  base  types  listed 
below)  or  one  of  the  various  type  constructors  defined  in  subsequent  sections.  A  type 
name  (or  other  user-assigned  name  in  DML)  may  be  any  valid  C  identifier  provided  it 
does  not  contain  two  consecutive  underscores  and  does  not  conflict  with  DML’s 
keywords. 

The  following  sections  describe  DML’s  base  types  and  type  constructors,  and  then- 
associated  representations  as  messages. 

3.  Base  Types 

The  following  set  of  base  types  is  predefined: 

enum  {namel,  name2,  . . . } 
flags  {namel,  name2,  . . . ) 

U32,  U64 
S32,  S64 
BYTES 
NULL 
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Enum  is  similar  to  the  enum  type  in  C.  Flags  denotes  a  set  of  up  to  32  Boolean  flags. 
U32  and  U64  denote  4-  and  8-byte  unsigned  integers,  and  S32  and  S64  denote 
signed  integers.  BYTES  denotes  a  variable-length  array  of  bytes.  NULL  denotes  an 
empty  message;  it  is  typically  used  as  a  union  element 

The  base  types  are  represented  as  follows.  The  U  and  S  types  have  length  4  or  8  bytes 
as  appropriate.  For  network  messages,  the  bytes  are  stored  in  network  byte  order  (the 
highest  order  byte  has  the  lowest  address).  The  enum  type  is  represented  as  a  32-bit 
unsigned  integer,  with  values  beginning  from  zero.  The  flags  type  is  represented  as  a 
4-byte  word;  bits  are  used  from  low-address  to  high-address  byte,  and  from  low  to  high- 
order  bits  within  a  byte.  The  BYTES  type  is  represented  as  a  32-bit  count  and  the  data 
bytes,  padded  to  a  4-byte  boundary. 

4.  Structures 

DML  structs  are  a  restricted  form  of  structures  (i.e.,  catenated  labeled  subtypes).  The 
restrictions  make  it  simpler  to  generate  macros  for  building  and  accessing  the  structs. 
The  facility  is  typically  used  to  define  the  portions  of  messages  (such  as  headers  and 
trailers)  that  are  accessed  by  a  particular  protocol  level. 

A  struct  is  defined  as  follows: 

struct  ( 

typel  labell; 
type2  label2 ; 

} 

Each  type  is  either  a  base  type  or  a  previously  defined  struct  type. 

A  struct  is  a  mixed  sequence  of  fixed-size  fields  and  variable-length  byte  arrays.  The 
representation  (as  a  byte-array  message)  is  as  follows: 

•  If  there  are  any  BYTES  fields,  the  first  word  of  the  message  is  the  total  length  of 
the  message. 

•  The  fixed- size  fields  are  placed  together,  in  the  order  of  their  declaration,  at  the  start 
of  the  message  (after  the  total  length  word,  if  present). 

•  The  fixed-size  fields  are  followed  by  a  vector  of  (ojfset,  length)  pairs  for  each  of  the 
BYTES  fields,  except  for  the  first  one  (whose  offset  and  length  are  implied  by  other 
information).  The  offset  field  is  the  distance  in  bytes  from  the  start  of  the 
message  to  the  beginning  of  the  real  bytes  field;  the  length  field  is  the  length  of 
the  byte  field  (including  possible  padding). 

•  Each  BYTES  field  is  represented  by  the  data  bytes,  padded  (by  unspecified  bytes) 
to  a  4-byte  boundary. 

Hence  a  struct  with  n  fixed-size  fields  and  m  BYTES  fields  is  represented  as  in  Figure  1. 

5.  Preprocessor-Generated  Macros 

We  will  specify  the  output  of  the  macro  preprocessor  by  giving  an  example.  Consider 
the  following  type  definitions: 
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Figure  1:  Internal  Representation  of  a  Structured  Message. 


expand  REQ_MSG; 

msg_typedef  U32  SEQ_NO; 

msg_typedef  enum  {OK,  FAILURE}  STATUS; 

msg_typedef  flags  {SECURE,  DEADLINE)  MSG  FLAGS; 
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insg_typedef  struct  { 

SEQ_NO  seq_no; 

MSG_FliAGS  insg_flags; 

BYTES  name; 

}  MSG_HEADER; 

msg_typedef  struct  { 

MSG_HEADER  header; 

enum  { EXACTLY_ONCE ,  MAYBE}  mode; 

BYTES  info; 

}  REQ_MSG; 

The  expand  declaration  specifies  the  message  types  for  which  macros  are  to  be 
generated  (in  this  example,  only  REQ_MSG  will  be  expanded). 

The  representation  of  a  REQ_MSG  message  is  as  shown  in  Figure  2  (where  n  is  the 
length  of  the  name  field,  rounded  up  to  a  multiple  of  4). 

The  DML  preprocessor  is  given  an  input  file,  and  a  flag  indicating  whether  the  target 
machine  uses  network  byte  order,  E  it  is  run  with  the  above  definitions  as  input,  it  will 
generate  the  following  macros. 


Figure  2:  The  Representation  of  a  REQ_MSG  Message. 
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5.1.  Enum  and  Flag  Definitions 

The  following  macros  define  enum  and  flag  values: 


♦define 

STATUS _ OK  0 

♦define 

STATUS _ FAILURE  1 

♦define 

MSG_FLAGS _ SECURE  1 

♦define 

MSG_FLAGS _ ^DEADLINE 

2 

♦define 

REQ_MSG _ mode _ EXACTLY_ 

ONCE 

♦define 

REQ_MSG _ mode  MAYBE 

1 

5.2.  Size  Computation 

Both  the  sender  and  the  receiver  of  a  message  need  to  compute  the  total  length  of  the 
message  they  are  handling.  This  length  depends  on  the  size  of  the  fixed-length  fields 
(which  can  be  calculated  by  the  DML  preprocessor)  and  on  the  variable  length  BYTES 
fields.  If  there  are  no  variable-length  fields,  the  total  size  of  the  message  is  a  constant.  If 
there  are  BYTES  field,  DML  generates  a  macro 

♦define  REQ_MSG _ sized,  j)  (28  +  mult4(i)  +  mult4(j)) 

used  by  the  sender  to  determine  the  size  of  the  message  as  a  function  of  the  lengths  of  the 
BYTES  fields,  and  a  macro 

♦define  REQ_MSG _ get3ize(d)  * ( (U32  *)  (d) ) 

used  by  the  receiver  to  get  the  total  length  stored  in  the  message  itself. 

REQ_MSG _ size  computes  the  total  size  of  a  message  instance,  given  values  for  the 

lengths  of  all  BYTES  fields.  It  uses  a  predefined  macro 
♦define  mult4(i)  ( (i+3) SOxffff ff fc) 
which  computes  the  smallest  multiple  of  4  at  least  as  large  as  its  argument. 

The  argument  of  REQ_MSG _ get  size  is  a  pointer  to  the  data  pan  of  a  MESSAGE 

object.  This  function  computes  the  size  of  the  REQ_MSG  instance  occurring  in  the 
message.  The  message  need  not  be  contiguous,  and  is  assumed  to  be  in  host  byte  order. 

If  there  are  no  BYTES  fields  in  the  message,  then  the  total  length  is  a  constant  In  this 
case  the  two  macros  above  expand  to  a  constant  value. 

5.3.  Initialization  of  Structural  Information 

This  macro  initializes  the  “bookkeeping”  fields  of  a  message  (total  length  of  the 
message  and  lengths  and  offsets  of  BYTES  fields),  given  the  lengths  of  its  BYTES 
entries.  If  there  are  no  BYTES  entries,  this  macro  is  empty. 

♦define  REQ_MSG _ format (p,  i,  j)  \ 

* {U32  *)  (p)  =  28  +  mult4(i)  +  mult4(j);  \ 

*  (U32  *)  ( (p)  +  24)  -  (i) ;  \ 

*(U32  *)  ( (p)  +  16)  =  28  +  mult4(i);  \ 

*  (U32  *)  (  (p)  +  20)  =  (j)  ; 

5.4.  Message  Field  Access 

Given  a  contiguous  message  with  bookkeeping  data  already  in  place,  the  following 
macros  compute  the  address  of  the  named  field,  and  recast  it  to  the  proper  type. 
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♦define  REQ_MSG _ header _ seq_no(p) 

♦define  REQ_MSG _ header _ ^m3g_flags (p) 

♦define  REQ_MSG _ header _ name  (p) 

♦define  REQ_MSG _ header  mode (p) 

♦define  REQ_MSG _ header _ info (p) 


((U32  *)  ((p)  +  4)) 

((U32  *)  ((p)  +  8)) 

(  (char  *)  ( (p)  +  28) ) 

(  (U32  *)  (  (p)  +  12)  ) 

((char  *)  ((p)  +  *((U32  *)  ((p)+16)))) 


If  REQ_MSG  had  contained  structs  having  fixed  size  (that  is,  without  any  BYTES 
fields),  then  macros  for  accessing  those  structs  would  have  been  generated,  too  (with  no 
recasting  of  pointers,  though).  This  kind  of  macro  is  useful,  for  example,  when  the 
programmer  needs  to  pass  the  address  of  a  struct  as  a  parameter  in  a  function  call. 


5.5.  BYTES  Field  Lengths 

The  following  macros  compute  the  actual  length  of  each  of  the  BYTES  fields  contained 
in  the  message. 

♦define  REQ_MSG _ header _ name _ length(p)  (*  (U32  *)  ( (p)  +  24)) 

♦define  REQ_MSG _ header _ info _ length(p)  (*  (U32  *)  ( (p)  +  20)) 

These  values,  retrieved  from  the  bookkeeping  information  of  the  message  itself,  show  the 
acmal  space  used  by  those  BYTES  fields,  that  is,  they  do  not  include  the  rounding  to  the 
next  multiple  of  4. 

5.6.  Byte  Order  Conversion 

This  macro  converts  a  message  instance  to  or  firom  network  byte  order.  Its  argument  is 
of  type  char*,  and  points  to  the  message,  which  must  be  contiguous.  If  the  target 

machine  is  big-endian  (i.e.,  uses  network  byte  order),  this  _ byteorder  macro  is 

empty.  On  a  little-endian  machine,  the  macro  is: 

♦define  REQ_MSG _ ^byteorder  (p)  \ 

BYTE_SWAP (p) ;  \ 

BYTE_SWAP ( (p)  +4);  \ 

BYTE_SWAP ( (p)  +8);  \ 

BYTE_SWAP ( (p)  +12);  \ 

BYTE_SWAP ( (p)  +  16);  \ 

BYTE_SWAP ( (p)  +  20)  ;  \ 

BYTE_SWAP ( (p)  +  24) ; 

This  macro  is  a  sequence  of  calls  to  the  predefined  macro  BYTE_SWAP,  which  converts 
a  32-bit  word  between  host  and  network  byte  order  (the  same  function  goes  both  ways). 

5.7.  Example 

The  following  operations  must  be  performed  in  order  to  build  a  message  using  DML: 

(1)  Compute  its  length  using  _ size. 

(2)  Initialize  the  bookkeeping  fields  using  _ format. 

(3)  Fill  in  the  data  fields. 

(4)  Immediately  before  sending  a  message  to  the  network,  and  after  receiving  it  from 

the  network,  use  _ byteorder  ( ) . 


6.  Other  Type  Constructors 

DML  provides  other  type  constructors.  These  are  for  documentation  purposes  only;  they 
are  checked  for  syntactic  correcmess,  but  no  macros  are  generated  for  these  types.  They 
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must  appear  at  the  end  of  a  message  definition. 

6.1.  List 

A  list,  like  a  struct,  represents  a  set  of  named  subtypes.  A  list,  however,  places  no 
restrictions  on  these  subtypes.  It  is  defined  by: 
list  { 

typel  label2; 
type2  label2; 


} 

A  list  is  represented  by  the  catenation  of  its  entries. 

6.2.  Union 

A  union  Qike  a  C  union)  represents  a  choice  of  one  type  out  of  a  set  of  types.  It  is 
defined  by 

union  { 

typel  labell; 
type2  label2; 

} 

The  representation  of  a  union  is  simply  the  representation  of  one  of  the  subtypes.  This 
differs  fi’om  the  C  language,  where  the  representation  of  a  union  is  large  enough  to 
accommodate  all  subtypes. 

6.3.  Arrays 

An  array  is  a  list  of  entries  of  a  single  (perhaps  variable-size)  type.  It  is  defined  by 

type [const] 
type  [] 

The  first  declares  a  fixed-length  array,  the  second  an  array  of  indeterminate  length.  An 
array  is  represented  by  the  catenation  of  its  elements. 


